* vxlan/veth performance issues on net.git + latest kernels @ 2013-12-03 15:05 Or Gerlitz 2013-12-03 15:30 ` Eric Dumazet 2013-12-03 17:12 ` Eric Dumazet 0 siblings, 2 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 15:05 UTC (permalink / raw) To: Eric Dumazet, Alexei Starovoitov, Pravin B Shelar; +Cc: David Miller, netdev I've been chasing lately a performance issues which come into play when combining veth and vxlan over fast Ethernet NIC. I came across it while working to enable TCP stateless offloads for vxlan encapsulated traffic in the mlx4 driver, but I can clearly see the issue without any HWoffloads involved, so it would be easier to discuss like that (no offloads involved). The setup involves a stacked {veth --> bridge --> vlxan --> IP stack --> NIC} or {veth --> ovs+vxlan --> IP stack --> NIC} chain. Basically, in my testbed which uses iperf over 40Gbs Mellanox NICs, vxlan traffic goes up to 5-7Gbs for single session and up to 14Gbs for multiple sessions, as long as veth isn't involved. Once veth is used I can't get to > 7-8Gbs, no matter how many sessions are used. For the time being, I manually took into account the tunneling overhead and reduced the veth pair MTU by 50 bytes. Looking on the kernel TCP counters in a {veth --> bridge --> vxlan --> NIC} configuration, on the client side I see lots of hits for the following TCP counters (the numbers are just single sample, I look on the output of iterative sampling every seconds, e.g using "watch -d -n 1 netstat -st"): 67092 segments retransmited 31461 times recovered from packet loss due to SACK data Detected reordering 1045142 times using SACK 436215 fast retransmits 59966 forward retransmits Also on the passive side I see hits for the "Quick ack mode was activated N times" counter, see below full snapshot of the counters from both sides. Without using veth, e.g when running in a {vxlan -> NIC} or {bridge --> vxlan --> NIC},I see hits only for the "recovered from packet loss due to SACK data" counter and fastretransmits counter, but not for the forward retransmits or "Detected reordering N timesusing SACK". Also, the quick ack mode counter isn't active on the passive side. I tested net.git (3.13-rc2+), 3.12.2 and 3.11.9, I see the same problems on all. At this point I don't really see a past point to go and apply bisection. So I hope this counter report can help to shed some light on the nature of the problem and possible solution, so ideas welcome!! without vxlan, these are the Gbs results for 1/2/4 streams over 3.12.2, the results for the net.git are pretty much the same. 18/32/38 NIC 17/30/35 bridge --> NIC 14/23/35 veth --> bridge --> NIC with vxlan, these are the Gbs results for 1/2/4 streams 6/12/14 vxlan --> IP --> NIC 5/10/14 bridge --> vxlan --> IP --> NIC 6/7/7 veth --> bridge --> vxlan --> IP --> NIC Also, the 3.12.2 number do get any better also when adding a ported version of 82d8189826d5 "veth: extend features to support tunneling" on top of 3.12.2 See @ the end the sequence of commands I use for the environment Or. --> TCP counters from active side # netstat -ts IcmpMsg: InType0: 2 InType8: 1 OutType0: 1 OutType3: 4 OutType8: 2 Tcp: 189 active connections openings 4 passive connection openings 0 failed connection attempts 0 connection resets received 4 connections established 22403193 segments received 541234150 segments send out 14248 segments retransmited 0 bad segments received. 5 resets sent UdpLite: TcpExt: 2 invalid SYN cookies received 178 TCP sockets finished time wait in fast timer 10 delayed acks sent Quick ack mode was activated 1 times 4 packets directly queued to recvmsg prequeue. 3728 packets directly received from backlog 2 packets directly received from prequeue 2524 packets header predicted 4 packets header predicted and directly queued to user 19793310 acknowledgments not containing data received 1216966 predicted acknowledgments 2130 times recovered from packet loss due to SACK data Detected reordering 73 times using FACK Detected reordering 11424 times using SACK 55 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 457 2 congestion windows recovered after partial ack 11498 fast retransmits 2748 forward retransmits 2 other TCP timeouts TCPLossProbes: 4 3 DSACKs sent for old packets TCPSackShifted: 1037782 TCPSackMerged: 332827 TCPSackShiftFallback: 598055 TCPRcvCoalesce: 380 TCPOFOQueue: 463 TCPSpuriousRtxHostQueues: 192 IpExt: InNoRoutes: 1 InMcastPkts: 191 OutMcastPkts: 28 InBcastPkts: 25 InOctets: 1789360097 OutOctets: 893757758988 InMcastOctets: 8152 OutMcastOctets: 3044 InBcastOctets: 4259 InNoECTPkts: 30117553 --> TCP counters from passiveside netstat -ts IcmpMsg: InType0: 1 InType8: 2 OutType0: 2 OutType3: 5 OutType8: 1 Tcp: 75 active connections openings 140 passive connection openings 0 failed connection attempts 0 connection resets received 4 connections established 146888643 segments received 27430160 segments send out 0 segments retransmited 0 bad segments received. 6 resets sent UdpLite: TcpExt: 3 invalid SYN cookies received 72 TCP sockets finished time wait in fast timer 10 delayed acks sent 3 delayed acks further delayed because of locked socket Quick ack mode was activated 13548 times 4 packets directly queued to recvmsg prequeue. 2 packets directly received from prequeue 139384763 packets header predicted 2 packets header predicted and directly queued to user 671 acknowledgments not containing data received 938 predicted acknowledgments TCPLossProbes: 2 TCPLossProbeRecovery: 1 14 DSACKs sent for old packets TCPBacklogDrop: 848 TCPRcvCoalesce: 118368414 TCPOFOQueue: 3167879 IpExt: InNoRoutes: 1 InMcastPkts: 184 OutMcastPkts: 26 InBcastPkts: 26 InOctets: 1007364296775 OutOctets: 2433872888 InMcastOctets: 6202 OutMcastOctets: 2888 InBcastOctets: 4597 InNoECTPkts: 702313233 client side (node 144) ---------------------- ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN ifconfig vxlan42 192.168.42.144/24 up brctl addbr br-vx ip link set br-vx up ifconfig br-vx 192.168.52.144/24 up brctl addif br-vx vxlan42 ip link add type veth brctl addif br-vx veth1 ifconfig veth0 192.168.62.144/24 up ip link set veth1 up ifconfig veth0 mtu 1450 ifconfig veth1 mtu 1450 server side (node 147) ---------------------- ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN ifconfig vxlan42 192.168.42.147/24 up brctl addbr br-vx ip link set br-vx up ifconfig br-vx 192.168.52.147/24 up brctl addif br-vx vxlan42 ip link add type veth brctl addif br-vx veth1 ifconfig veth0 192.168.62.147/24 up ip link set veth1 up ifconfig veth0 mtu 1450 ifconfig veth1 mtu 1450 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 15:05 vxlan/veth performance issues on net.git + latest kernels Or Gerlitz @ 2013-12-03 15:30 ` Eric Dumazet 2013-12-03 19:55 ` Or Gerlitz 2013-12-03 17:12 ` Eric Dumazet 1 sibling, 1 reply; 63+ messages in thread From: Eric Dumazet @ 2013-12-03 15:30 UTC (permalink / raw) To: Or Gerlitz Cc: Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, 2013-12-03 at 17:05 +0200, Or Gerlitz wrote: > I've been chasing lately a performance issues which come into play when > combining veth and vxlan over fast Ethernet NIC. > > I came across it while working to enable TCP stateless offloads for > vxlan encapsulated traffic in the mlx4 driver, but I can clearly see the > issue without any HWoffloads involved, so it would be easier to discuss > like that (no offloads involved). > > The setup involves a stacked {veth --> bridge --> vlxan --> IP stack --> > NIC} or {veth --> ovs+vxlan --> IP stack --> NIC} chain. > > Basically, in my testbed which uses iperf over 40Gbs Mellanox NICs, > vxlan traffic goes up to 5-7Gbs for single session and up to 14Gbs for > multiple sessions, as long as veth isn't involved. Once veth is used I > can't get to > 7-8Gbs, no matter how many sessions are used. For the > time being, I manually took into account the tunneling overhead and > reduced the veth pair MTU by 50 bytes. > > Looking on the kernel TCP counters in a {veth --> bridge --> vxlan --> > NIC} configuration, on the client side I see lots of hits for the > following TCP counters (the numbers are just single sample, I look on > the output of iterative sampling every seconds, e.g using "watch -d -n 1 > netstat -st"): > > 67092 segments retransmited > > 31461 times recovered from packet loss due to SACK data > Detected reordering 1045142 times using SACK > 436215 fast retransmits > 59966 forward retransmits > > Also on the passive side I see hits for the "Quick ack mode was > activated N times" counter, see below full snapshot of the counters from > both sides. > > Without using veth, e.g when running in a {vxlan -> NIC} or {bridge --> > vxlan --> NIC},I see hits only for the "recovered from packet loss due > to SACK data" counter and fastretransmits counter, but not for the > forward retransmits or "Detected reordering N timesusing SACK". Also, > the quick ack mode counter isn't active on the passive side. > > I tested net.git (3.13-rc2+), 3.12.2 and 3.11.9, I see the same problems > on all. At this point I don't really see a past point to go and apply > bisection. So I hope this counter report can help to shed some light on > the nature of the problem and possible solution, so ideas welcome!! > > without vxlan, these are the Gbs results for 1/2/4 streams over 3.12.2, > the results > for the net.git are pretty much the same. > > 18/32/38 NIC > 17/30/35 bridge --> NIC > 14/23/35 veth --> bridge --> NIC > > with vxlan, these are the Gbs results for 1/2/4 streams > > 6/12/14 vxlan --> IP --> NIC > 5/10/14 bridge --> vxlan --> IP --> NIC > 6/7/7 veth --> bridge --> vxlan --> IP --> NIC > > Also, the 3.12.2 number do get any better also when adding a ported > version of 82d8189826d5 "veth: extend features to support tunneling" on > top of 3.12.2 > > See @ the end the sequence of commands I use for the environment > > Or. > > > --> TCP counters from active side > > # netstat -ts > IcmpMsg: > InType0: 2 > InType8: 1 > OutType0: 1 > OutType3: 4 > OutType8: 2 > Tcp: > 189 active connections openings > 4 passive connection openings > 0 failed connection attempts > 0 connection resets received > 4 connections established > 22403193 segments received > 541234150 segments send out > 14248 segments retransmited > 0 bad segments received. > 5 resets sent > UdpLite: > TcpExt: > 2 invalid SYN cookies received > 178 TCP sockets finished time wait in fast timer > 10 delayed acks sent > Quick ack mode was activated 1 times > 4 packets directly queued to recvmsg prequeue. > 3728 packets directly received from backlog > 2 packets directly received from prequeue > 2524 packets header predicted > 4 packets header predicted and directly queued to user > 19793310 acknowledgments not containing data received > 1216966 predicted acknowledgments > 2130 times recovered from packet loss due to SACK data > Detected reordering 73 times using FACK > Detected reordering 11424 times using SACK > 55 congestion windows partially recovered using Hoe heuristic > TCPDSACKUndo: 457 > 2 congestion windows recovered after partial ack > 11498 fast retransmits > 2748 forward retransmits > 2 other TCP timeouts > TCPLossProbes: 4 > 3 DSACKs sent for old packets > TCPSackShifted: 1037782 > TCPSackMerged: 332827 > TCPSackShiftFallback: 598055 > TCPRcvCoalesce: 380 > TCPOFOQueue: 463 > TCPSpuriousRtxHostQueues: 192 > IpExt: > InNoRoutes: 1 > InMcastPkts: 191 > OutMcastPkts: 28 > InBcastPkts: 25 > InOctets: 1789360097 > OutOctets: 893757758988 > InMcastOctets: 8152 > OutMcastOctets: 3044 > InBcastOctets: 4259 > InNoECTPkts: 30117553 > > > > --> TCP counters from passiveside > > netstat -ts > IcmpMsg: > InType0: 1 > InType8: 2 > OutType0: 2 > OutType3: 5 > OutType8: 1 > Tcp: > 75 active connections openings > 140 passive connection openings > 0 failed connection attempts > 0 connection resets received > 4 connections established > 146888643 segments received > 27430160 segments send out > 0 segments retransmited > 0 bad segments received. > 6 resets sent > UdpLite: > TcpExt: > 3 invalid SYN cookies received > 72 TCP sockets finished time wait in fast timer > 10 delayed acks sent > 3 delayed acks further delayed because of locked socket > Quick ack mode was activated 13548 times > 4 packets directly queued to recvmsg prequeue. > 2 packets directly received from prequeue > 139384763 packets header predicted > 2 packets header predicted and directly queued to user > 671 acknowledgments not containing data received > 938 predicted acknowledgments > TCPLossProbes: 2 > TCPLossProbeRecovery: 1 > 14 DSACKs sent for old packets > TCPBacklogDrop: 848 Thats bad : Dropping packets on receiver. Check also "ifconfig -a" to see if rxdrop is increasing as well. > TCPRcvCoalesce: 118368414 lack of GRO : receiver seems to not be able to receive as fast as you want. > TCPOFOQueue: 3167879 So many packets are received out of order (because of losses) > IpExt: > InNoRoutes: 1 > InMcastPkts: 184 > OutMcastPkts: 26 > InBcastPkts: 26 > InOctets: 1007364296775 > OutOctets: 2433872888 > InMcastOctets: 6202 > OutMcastOctets: 2888 > InBcastOctets: 4597 > InNoECTPkts: 702313233 > > > client side (node 144) > ---------------------- > > ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN > ifconfig vxlan42 192.168.42.144/24 up > > brctl addbr br-vx > ip link set br-vx up > > ifconfig br-vx 192.168.52.144/24 up > brctl addif br-vx vxlan42 > > ip link add type veth > brctl addif br-vx veth1 > ifconfig veth0 192.168.62.144/24 up > ip link set veth1 up > > ifconfig veth0 mtu 1450 > ifconfig veth1 mtu 1450 > > > server side (node 147) > ---------------------- > > ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN > ifconfig vxlan42 192.168.42.147/24 up > > brctl addbr br-vx > ip link set br-vx up > > ifconfig br-vx 192.168.52.147/24 up > brctl addif br-vx vxlan42 > > > ip link add type veth > brctl addif br-vx veth1 > ifconfig veth0 192.168.62.147/24 up > ip link set veth1 up > > ifconfig veth0 mtu 1450 > ifconfig veth1 mtu 1450 > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 15:30 ` Eric Dumazet @ 2013-12-03 19:55 ` Or Gerlitz 2013-12-03 21:11 ` Joseph Gasparakis 0 siblings, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 19:55 UTC (permalink / raw) To: Eric Dumazet, Jerry Chu Cc: Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, Dec 3, 2013 at 5:30 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2013-12-03 at 17:05 +0200, Or Gerlitz wrote: >> I've been chasing lately a performance issues which come into play when >> combining veth and vxlan over fast Ethernet NIC. >> >> I came across it while working to enable TCP stateless offloads for >> vxlan encapsulated traffic in the mlx4 driver, but I can clearly see the >> issue without any HWoffloads involved, so it would be easier to discuss >> like that (no offloads involved). >> >> The setup involves a stacked {veth --> bridge --> vlxan --> IP stack --> >> NIC} or {veth --> ovs+vxlan --> IP stack --> NIC} chain. >> >> Basically, in my testbed which uses iperf over 40Gbs Mellanox NICs, >> vxlan traffic goes up to 5-7Gbs for single session and up to 14Gbs for >> multiple sessions, as long as veth isn't involved. Once veth is used I >> can't get to > 7-8Gbs, no matter how many sessions are used. For the >> time being, I manually took into account the tunneling overhead and >> reduced the veth pair MTU by 50 bytes. >> >> Looking on the kernel TCP counters in a {veth --> bridge --> vxlan --> >> NIC} configuration, on the client side I see lots of hits for the >> following TCP counters (the numbers are just single sample, I look on >> the output of iterative sampling every seconds, e.g using "watch -d -n 1 >> netstat -st"): >> >> 67092 segments retransmited >> >> 31461 times recovered from packet loss due to SACK data >> Detected reordering 1045142 times using SACK >> 436215 fast retransmits >> 59966 forward retransmits >> >> Also on the passive side I see hits for the "Quick ack mode was >> activated N times" counter, see below full snapshot of the counters from >> both sides. >> >> Without using veth, e.g when running in a {vxlan -> NIC} or {bridge --> >> vxlan --> NIC},I see hits only for the "recovered from packet loss due >> to SACK data" counter and fastretransmits counter, but not for the >> forward retransmits or "Detected reordering N timesusing SACK". Also, >> the quick ack mode counter isn't active on the passive side. >> >> I tested net.git (3.13-rc2+), 3.12.2 and 3.11.9, I see the same problems >> on all. At this point I don't really see a past point to go and apply >> bisection. So I hope this counter report can help to shed some light on >> the nature of the problem and possible solution, so ideas welcome!! >> >> without vxlan, these are the Gbs results for 1/2/4 streams over 3.12.2, >> the results >> for the net.git are pretty much the same. >> >> 18/32/38 NIC >> 17/30/35 bridge --> NIC >> 14/23/35 veth --> bridge --> NIC >> >> with vxlan, these are the Gbs results for 1/2/4 streams >> >> 6/12/14 vxlan --> IP --> NIC >> 5/10/14 bridge --> vxlan --> IP --> NIC >> 6/7/7 veth --> bridge --> vxlan --> IP --> NIC >> >> Also, the 3.12.2 number do get any better also when adding a ported >> version of 82d8189826d5 "veth: extend features to support tunneling" on >> top of 3.12.2 >> >> See @ the end the sequence of commands I use for the environment >> >> Or. >> >> >> --> TCP counters from active side >> >> # netstat -ts >> IcmpMsg: >> InType0: 2 >> InType8: 1 >> OutType0: 1 >> OutType3: 4 >> OutType8: 2 >> Tcp: >> 189 active connections openings >> 4 passive connection openings >> 0 failed connection attempts >> 0 connection resets received >> 4 connections established >> 22403193 segments received >> 541234150 segments send out >> 14248 segments retransmited >> 0 bad segments received. >> 5 resets sent >> UdpLite: >> TcpExt: >> 2 invalid SYN cookies received >> 178 TCP sockets finished time wait in fast timer >> 10 delayed acks sent >> Quick ack mode was activated 1 times >> 4 packets directly queued to recvmsg prequeue. >> 3728 packets directly received from backlog >> 2 packets directly received from prequeue >> 2524 packets header predicted >> 4 packets header predicted and directly queued to user >> 19793310 acknowledgments not containing data received >> 1216966 predicted acknowledgments >> 2130 times recovered from packet loss due to SACK data >> Detected reordering 73 times using FACK >> Detected reordering 11424 times using SACK >> 55 congestion windows partially recovered using Hoe heuristic >> TCPDSACKUndo: 457 >> 2 congestion windows recovered after partial ack >> 11498 fast retransmits >> 2748 forward retransmits >> 2 other TCP timeouts >> TCPLossProbes: 4 >> 3 DSACKs sent for old packets >> TCPSackShifted: 1037782 >> TCPSackMerged: 332827 >> TCPSackShiftFallback: 598055 >> TCPRcvCoalesce: 380 >> TCPOFOQueue: 463 >> TCPSpuriousRtxHostQueues: 192 >> IpExt: >> InNoRoutes: 1 >> InMcastPkts: 191 >> OutMcastPkts: 28 >> InBcastPkts: 25 >> InOctets: 1789360097 >> OutOctets: 893757758988 >> InMcastOctets: 8152 >> OutMcastOctets: 3044 >> InBcastOctets: 4259 >> InNoECTPkts: 30117553 >> >> >> >> --> TCP counters from passiveside >> >> netstat -ts >> IcmpMsg: >> InType0: 1 >> InType8: 2 >> OutType0: 2 >> OutType3: 5 >> OutType8: 1 >> Tcp: >> 75 active connections openings >> 140 passive connection openings >> 0 failed connection attempts >> 0 connection resets received >> 4 connections established >> 146888643 segments received >> 27430160 segments send out >> 0 segments retransmited >> 0 bad segments received. >> 6 resets sent >> UdpLite: >> TcpExt: >> 3 invalid SYN cookies received >> 72 TCP sockets finished time wait in fast timer >> 10 delayed acks sent >> 3 delayed acks further delayed because of locked socket >> Quick ack mode was activated 13548 times >> 4 packets directly queued to recvmsg prequeue. >> 2 packets directly received from prequeue >> 139384763 packets header predicted >> 2 packets header predicted and directly queued to user >> 671 acknowledgments not containing data received >> 938 predicted acknowledgments >> TCPLossProbes: 2 >> TCPLossProbeRecovery: 1 >> 14 DSACKs sent for old packets >> TCPBacklogDrop: 848 > > Thats bad : Dropping packets on receiver. > > Check also "ifconfig -a" to see if rxdrop is increasing as well. > >> TCPRcvCoalesce: 118368414 > > lack of GRO : receiver seems to not be able to receive as fast as you want. > >> TCPOFOQueue: 3167879 > > So many packets are received out of order (because of losses) I see that there's no GRO also for the non-veth tests which involve vxlan, and over there the receiving side is capable to consume the packets, do you have rough explaination why adding veth to the chain is such game changer which makes things to start falling out? > >> IpExt: >> InNoRoutes: 1 >> InMcastPkts: 184 >> OutMcastPkts: 26 >> InBcastPkts: 26 >> InOctets: 1007364296775 >> OutOctets: 2433872888 >> InMcastOctets: 6202 >> OutMcastOctets: 2888 >> InBcastOctets: 4597 >> InNoECTPkts: 702313233 >> >> >> client side (node 144) >> ---------------------- >> >> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN >> ifconfig vxlan42 192.168.42.144/24 up >> >> brctl addbr br-vx >> ip link set br-vx up >> >> ifconfig br-vx 192.168.52.144/24 up >> brctl addif br-vx vxlan42 >> >> ip link add type veth >> brctl addif br-vx veth1 >> ifconfig veth0 192.168.62.144/24 up >> ip link set veth1 up >> >> ifconfig veth0 mtu 1450 >> ifconfig veth1 mtu 1450 >> >> >> server side (node 147) >> ---------------------- >> >> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN >> ifconfig vxlan42 192.168.42.147/24 up >> >> brctl addbr br-vx >> ip link set br-vx up >> >> ifconfig br-vx 192.168.52.147/24 up >> brctl addif br-vx vxlan42 >> >> >> ip link add type veth >> brctl addif br-vx veth1 >> ifconfig veth0 192.168.62.147/24 up >> ip link set veth1 up >> >> ifconfig veth0 mtu 1450 >> ifconfig veth1 mtu 1450 >> >> > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 19:55 ` Or Gerlitz @ 2013-12-03 21:11 ` Joseph Gasparakis 2013-12-03 21:09 ` Or Gerlitz 0 siblings, 1 reply; 63+ messages in thread From: Joseph Gasparakis @ 2013-12-03 21:11 UTC (permalink / raw) To: Or Gerlitz Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, 3 Dec 2013, Or Gerlitz wrote: > On Tue, Dec 3, 2013 at 5:30 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Tue, 2013-12-03 at 17:05 +0200, Or Gerlitz wrote: > >> I've been chasing lately a performance issues which come into play when > >> combining veth and vxlan over fast Ethernet NIC. > >> > >> I came across it while working to enable TCP stateless offloads for > >> vxlan encapsulated traffic in the mlx4 driver, but I can clearly see the > >> issue without any HWoffloads involved, so it would be easier to discuss > >> like that (no offloads involved). > >> > >> The setup involves a stacked {veth --> bridge --> vlxan --> IP stack --> > >> NIC} or {veth --> ovs+vxlan --> IP stack --> NIC} chain. > >> > >> Basically, in my testbed which uses iperf over 40Gbs Mellanox NICs, > >> vxlan traffic goes up to 5-7Gbs for single session and up to 14Gbs for > >> multiple sessions, as long as veth isn't involved. Once veth is used I > >> can't get to > 7-8Gbs, no matter how many sessions are used. For the > >> time being, I manually took into account the tunneling overhead and > >> reduced the veth pair MTU by 50 bytes. > >> > >> Looking on the kernel TCP counters in a {veth --> bridge --> vxlan --> > >> NIC} configuration, on the client side I see lots of hits for the > >> following TCP counters (the numbers are just single sample, I look on > >> the output of iterative sampling every seconds, e.g using "watch -d -n 1 > >> netstat -st"): > >> > >> 67092 segments retransmited > >> > >> 31461 times recovered from packet loss due to SACK data > >> Detected reordering 1045142 times using SACK > >> 436215 fast retransmits > >> 59966 forward retransmits > >> > >> Also on the passive side I see hits for the "Quick ack mode was > >> activated N times" counter, see below full snapshot of the counters from > >> both sides. > >> > >> Without using veth, e.g when running in a {vxlan -> NIC} or {bridge --> > >> vxlan --> NIC},I see hits only for the "recovered from packet loss due > >> to SACK data" counter and fastretransmits counter, but not for the > >> forward retransmits or "Detected reordering N timesusing SACK". Also, > >> the quick ack mode counter isn't active on the passive side. > >> > >> I tested net.git (3.13-rc2+), 3.12.2 and 3.11.9, I see the same problems > >> on all. At this point I don't really see a past point to go and apply > >> bisection. So I hope this counter report can help to shed some light on > >> the nature of the problem and possible solution, so ideas welcome!! > >> > >> without vxlan, these are the Gbs results for 1/2/4 streams over 3.12.2, > >> the results > >> for the net.git are pretty much the same. > >> > >> 18/32/38 NIC > >> 17/30/35 bridge --> NIC > >> 14/23/35 veth --> bridge --> NIC > >> > >> with vxlan, these are the Gbs results for 1/2/4 streams > >> > >> 6/12/14 vxlan --> IP --> NIC > >> 5/10/14 bridge --> vxlan --> IP --> NIC > >> 6/7/7 veth --> bridge --> vxlan --> IP --> NIC > >> > >> Also, the 3.12.2 number do get any better also when adding a ported > >> version of 82d8189826d5 "veth: extend features to support tunneling" on > >> top of 3.12.2 > >> > >> See @ the end the sequence of commands I use for the environment > >> > >> Or. > >> > >> > >> --> TCP counters from active side > >> > >> # netstat -ts > >> IcmpMsg: > >> InType0: 2 > >> InType8: 1 > >> OutType0: 1 > >> OutType3: 4 > >> OutType8: 2 > >> Tcp: > >> 189 active connections openings > >> 4 passive connection openings > >> 0 failed connection attempts > >> 0 connection resets received > >> 4 connections established > >> 22403193 segments received > >> 541234150 segments send out > >> 14248 segments retransmited > >> 0 bad segments received. > >> 5 resets sent > >> UdpLite: > >> TcpExt: > >> 2 invalid SYN cookies received > >> 178 TCP sockets finished time wait in fast timer > >> 10 delayed acks sent > >> Quick ack mode was activated 1 times > >> 4 packets directly queued to recvmsg prequeue. > >> 3728 packets directly received from backlog > >> 2 packets directly received from prequeue > >> 2524 packets header predicted > >> 4 packets header predicted and directly queued to user > >> 19793310 acknowledgments not containing data received > >> 1216966 predicted acknowledgments > >> 2130 times recovered from packet loss due to SACK data > >> Detected reordering 73 times using FACK > >> Detected reordering 11424 times using SACK > >> 55 congestion windows partially recovered using Hoe heuristic > >> TCPDSACKUndo: 457 > >> 2 congestion windows recovered after partial ack > >> 11498 fast retransmits > >> 2748 forward retransmits > >> 2 other TCP timeouts > >> TCPLossProbes: 4 > >> 3 DSACKs sent for old packets > >> TCPSackShifted: 1037782 > >> TCPSackMerged: 332827 > >> TCPSackShiftFallback: 598055 > >> TCPRcvCoalesce: 380 > >> TCPOFOQueue: 463 > >> TCPSpuriousRtxHostQueues: 192 > >> IpExt: > >> InNoRoutes: 1 > >> InMcastPkts: 191 > >> OutMcastPkts: 28 > >> InBcastPkts: 25 > >> InOctets: 1789360097 > >> OutOctets: 893757758988 > >> InMcastOctets: 8152 > >> OutMcastOctets: 3044 > >> InBcastOctets: 4259 > >> InNoECTPkts: 30117553 > >> > >> > >> > >> --> TCP counters from passiveside > >> > >> netstat -ts > >> IcmpMsg: > >> InType0: 1 > >> InType8: 2 > >> OutType0: 2 > >> OutType3: 5 > >> OutType8: 1 > >> Tcp: > >> 75 active connections openings > >> 140 passive connection openings > >> 0 failed connection attempts > >> 0 connection resets received > >> 4 connections established > >> 146888643 segments received > >> 27430160 segments send out > >> 0 segments retransmited > >> 0 bad segments received. > >> 6 resets sent > >> UdpLite: > >> TcpExt: > >> 3 invalid SYN cookies received > >> 72 TCP sockets finished time wait in fast timer > >> 10 delayed acks sent > >> 3 delayed acks further delayed because of locked socket > >> Quick ack mode was activated 13548 times > >> 4 packets directly queued to recvmsg prequeue. > >> 2 packets directly received from prequeue > >> 139384763 packets header predicted > >> 2 packets header predicted and directly queued to user > >> 671 acknowledgments not containing data received > >> 938 predicted acknowledgments > >> TCPLossProbes: 2 > >> TCPLossProbeRecovery: 1 > >> 14 DSACKs sent for old packets > >> TCPBacklogDrop: 848 > > > > Thats bad : Dropping packets on receiver. > > > > Check also "ifconfig -a" to see if rxdrop is increasing as well. > > > >> TCPRcvCoalesce: 118368414 > > > > lack of GRO : receiver seems to not be able to receive as fast as you want. > > > >> TCPOFOQueue: 3167879 > > > > So many packets are received out of order (because of losses) > > I see that there's no GRO also for the non-veth tests which involve > vxlan, and over there the receiving side is capable to consume the > packets, do you have rough explaination why adding veth to the chain > is such game changer which makes things to start falling out? > I have seen this before. Here are my findings: The gso_type is different if the skb comes from veth or not. From veth, you will see the SKB_GSO_DODGY set. This breaks things as when the skb with DODGY set moves from vxlan to the driver through dev_xmit_hard, the stack drops it silently. I never got the time to find the root cause for this, but I know it causes re-transmissions and big performance degregation. I went as far as just quickly hacking a one liner unsetting the DODGY bit in vxlan.c and that bypassed the issue and recovered the performance problem, but obviously this is not a real fix. > > > > > >> IpExt: > >> InNoRoutes: 1 > >> InMcastPkts: 184 > >> OutMcastPkts: 26 > >> InBcastPkts: 26 > >> InOctets: 1007364296775 > >> OutOctets: 2433872888 > >> InMcastOctets: 6202 > >> OutMcastOctets: 2888 > >> InBcastOctets: 4597 > >> InNoECTPkts: 702313233 > >> > >> > >> client side (node 144) > >> ---------------------- > >> > >> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN > >> ifconfig vxlan42 192.168.42.144/24 up > >> > >> brctl addbr br-vx > >> ip link set br-vx up > >> > >> ifconfig br-vx 192.168.52.144/24 up > >> brctl addif br-vx vxlan42 > >> > >> ip link add type veth > >> brctl addif br-vx veth1 > >> ifconfig veth0 192.168.62.144/24 up > >> ip link set veth1 up > >> > >> ifconfig veth0 mtu 1450 > >> ifconfig veth1 mtu 1450 > >> > >> > >> server side (node 147) > >> ---------------------- > >> > >> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN > >> ifconfig vxlan42 192.168.42.147/24 up > >> > >> brctl addbr br-vx > >> ip link set br-vx up > >> > >> ifconfig br-vx 192.168.52.147/24 up > >> brctl addif br-vx vxlan42 > >> > >> > >> ip link add type veth > >> brctl addif br-vx veth1 > >> ifconfig veth0 192.168.62.147/24 up > >> ip link set veth1 up > >> > >> ifconfig veth0 mtu 1450 > >> ifconfig veth1 mtu 1450 > >> > >> > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 21:11 ` Joseph Gasparakis @ 2013-12-03 21:09 ` Or Gerlitz 2013-12-03 21:24 ` Eric Dumazet 2013-12-03 23:13 ` vxlan/veth performance issues on net.git + latest kernels Joseph Gasparakis 0 siblings, 2 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 21:09 UTC (permalink / raw) To: Joseph Gasparakis Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote: >>> lack of GRO : receiver seems to not be able to receive as fast as you want. >>>> TCPOFOQueue: 3167879 >>> So many packets are received out of order (because of losses) >> I see that there's no GRO also for the non-veth tests which involve >> vxlan, and over there the receiving side is capable to consume the >> packets, do you have rough explaination why adding veth to the chain >> is such game changer which makes things to start falling out? > I have seen this before. Here are my findings: > > The gso_type is different if the skb comes from veth or not. From veth, > you will see the SKB_GSO_DODGY set. This breaks things as when the > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard, > the stack drops it silently. I never got the time to find the root cause > for this, but I know it causes re-transmissions and big performance > degregation. > > I went as far as just quickly hacking a one liner unsetting the DODGY bit > in vxlan.c and that bypassed the issue and recovered the performance > problem, but obviously this is not a real fix. thanks for the heads up, few quick questions/clafications -- -- you are talking on drops done @ the sender side, correct? Eric was saying we have evidences that the drops happen on the receiver. -- without the hack you did, still packets are sent/received, so what makes the stack to drop only some of them? -- why packets coming from veth would have the SKB_GSO_DODGY bit set? -- so where is now (say net.git or 3.12.x) this one line you commented out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c explicit setting of SKB_GSO_DODGY Also, I am pretty sure the problem exists also when sending/receiving guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I just sticked to the veth flavour b/c its one (== the hypervisor) network stack to debug and not two (+ the guest one). ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 21:09 ` Or Gerlitz @ 2013-12-03 21:24 ` Eric Dumazet 2013-12-03 21:36 ` Or Gerlitz 2013-12-03 23:13 ` vxlan/veth performance issues on net.git + latest kernels Joseph Gasparakis 1 sibling, 1 reply; 63+ messages in thread From: Eric Dumazet @ 2013-12-03 21:24 UTC (permalink / raw) To: Or Gerlitz Cc: Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, 2013-12-03 at 23:09 +0200, Or Gerlitz wrote: > On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis > <joseph.gasparakis@intel.com> wrote: > > >>> lack of GRO : receiver seems to not be able to receive as fast as you want. > >>>> TCPOFOQueue: 3167879 > >>> So many packets are received out of order (because of losses) > > >> I see that there's no GRO also for the non-veth tests which involve > >> vxlan, and over there the receiving side is capable to consume the > >> packets, do you have rough explaination why adding veth to the chain > >> is such game changer which makes things to start falling out? > > > I have seen this before. Here are my findings: > > > > The gso_type is different if the skb comes from veth or not. From veth, > > you will see the SKB_GSO_DODGY set. This breaks things as when the > > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard, > > the stack drops it silently. I never got the time to find the root cause > > for this, but I know it causes re-transmissions and big performance > > degregation. > > > > I went as far as just quickly hacking a one liner unsetting the DODGY bit > > in vxlan.c and that bypassed the issue and recovered the performance > > problem, but obviously this is not a real fix. > > thanks for the heads up, few quick questions/clafications -- > > -- you are talking on drops done @ the sender side, correct? Eric was > saying we have evidences that the drops happen on the receiver. I suggested you take a look at the receiver, like "ifconfig -a" I suspect one cpu is 100% in sofirq mode draining packets from the NIC and feeding IP / TCP stack. Because of vxlan encap, all the packets are delivered to a single RX queue (I dont think mlx4 is able to look at inner header to get L4 info) mpstat -P ALL 10 10 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 21:24 ` Eric Dumazet @ 2013-12-03 21:36 ` Or Gerlitz 2013-12-03 21:50 ` David Miller 0 siblings, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 21:36 UTC (permalink / raw) To: Eric Dumazet Cc: Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, Dec 3, 2013 at 11:24 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2013-12-03 at 23:09 +0200, Or Gerlitz wrote: >> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis >> <joseph.gasparakis@intel.com> wrote: >> >> >>> lack of GRO : receiver seems to not be able to receive as fast as you want. >> >>>> TCPOFOQueue: 3167879 >> >>> So many packets are received out of order (because of losses) >> >> >> I see that there's no GRO also for the non-veth tests which involve >> >> vxlan, and over there the receiving side is capable to consume the >> >> packets, do you have rough explaination why adding veth to the chain >> >> is such game changer which makes things to start falling out? >> >> > I have seen this before. Here are my findings: >> > >> > The gso_type is different if the skb comes from veth or not. From veth, >> > you will see the SKB_GSO_DODGY set. This breaks things as when the >> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard, >> > the stack drops it silently. I never got the time to find the root cause >> > for this, but I know it causes re-transmissions and big performance >> > degregation. >> > >> > I went as far as just quickly hacking a one liner unsetting the DODGY bit >> > in vxlan.c and that bypassed the issue and recovered the performance >> > problem, but obviously this is not a real fix. >> >> thanks for the heads up, few quick questions/clafications -- >> >> -- you are talking on drops done @ the sender side, correct? Eric was >> saying we have evidences that the drops happen on the receiver. > > I suggested you take a look at the receiver, like "ifconfig -a" Eric, sorry I am away from the system now, will try to get some access and report back now and if not, tomorow, but > I suspect one cpu is 100% in sofirq mode draining packets from the NIC > and feeding IP / TCP stack. > Because of vxlan encap, all the packets are delivered to a single RX > queue (I dont think mlx4 is able to look at inner header to get L4 info) With the new card, ConnectX3-pro we are able to look on inner headers and do RX/TX checksum and LSO for the encapsulated traffic, this is how I initially got into this problem. But as I wrote earlier, I was able to see the problem w.o activating the offloads for the inner packets. Sorry if I didn't mention that, but from the mlx4_en NIC driver point of view, different stream do map to different RX queues, b/c the HW does RSS on the outer (UDP) header and the sender vxlan code uses few sockets to send multiple streams which each having difference source UDP port. For the "outer RSS" you don't need the -pro card, just make sure the udp_rss module param of mlx4 is set. I also thought that under veth there's contention point which could explain why packets are dropped, but haven't found it. > mpstat -P ALL 10 10 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 21:36 ` Or Gerlitz @ 2013-12-03 21:50 ` David Miller 2013-12-03 21:55 ` Eric Dumazet 0 siblings, 1 reply; 63+ messages in thread From: David Miller @ 2013-12-03 21:50 UTC (permalink / raw) To: or.gerlitz Cc: eric.dumazet, joseph.gasparakis, hkchu, ogerlitz, edumazet, ast, pshelar, netdev From: Or Gerlitz <or.gerlitz@gmail.com> Date: Tue, 3 Dec 2013 23:36:50 +0200 > I also thought that under veth there's contention point which could > explain why packets are dropped, but haven't found it. At this point I would use drop monitor to figure out in what context packets are being dropped on the floor. There are scripts provided with the perf tool to utilize it. I suspect what you will find is that either the cpu is at %100, or sporadic events where GRO is not performed is killing the stream. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 21:50 ` David Miller @ 2013-12-03 21:55 ` Eric Dumazet 2013-12-03 22:15 ` Or Gerlitz ` (2 more replies) 0 siblings, 3 replies; 63+ messages in thread From: Eric Dumazet @ 2013-12-03 21:55 UTC (permalink / raw) To: David Miller Cc: or.gerlitz, joseph.gasparakis, hkchu, ogerlitz, edumazet, ast, pshelar, netdev On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote: > At this point I would use drop monitor to figure out in what context > packets are being dropped on the floor. There are scripts provided > with the perf tool to utilize it. Most easy way is to do : perf record -e skb:kfree_skb -a -g sleep 10 perf report ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 21:55 ` Eric Dumazet @ 2013-12-03 22:15 ` Or Gerlitz 2013-12-03 22:22 ` Or Gerlitz 2013-12-03 23:10 ` Or Gerlitz 2 siblings, 0 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 22:15 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, netdev@vger.kernel.org On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote: > >> At this point I would use drop monitor to figure out in what context >> packets are being dropped on the floor. There are scripts provided >> with the perf tool to utilize it. > > Most easy way is to do : > > perf record -e skb:kfree_skb -a -g sleep 10 > > perf report The version of perf I have on these nodes fail to run the 1st command, anyway, here's some data which was asked by Eric passive side top + plain perf for two streams top - 00:08:09 up 7:53, 3 users, load average: 0.59, 0.43, 0.32 Tasks: 134 total, 1 running, 133 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 5.7%sy, 0.0%ni, 92.2%id, 0.0%wa, 0.0%hi, 2.0%si, 0.0%st Cpu1 : 0.6%us, 17.2%sy, 0.0%ni, 30.6%id, 0.0%wa, 0.0%hi, 51.7%si, 0.0%st Cpu2 : 0.7%us, 4.3%sy, 0.0%ni, 93.6%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st Cpu3 : 0.0%us, 1.7%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 14.4%sy, 0.0%ni, 50.0%id, 0.0%wa, 0.0%hi, 35.6%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8183220k total, 1383012k used, 6800208k free, 10724k buffers Swap: 2097148k total, 0k used, 2097148k free, 127520k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3857 root 20 0 381m 640 468 S 69.2 0.0 6:45.22 iperf 15 root 20 0 0 0 0 S 7.3 0.0 0:09.92 ksoftirqd/1 40 root 20 0 0 0 0 S 7.3 0.0 0:12.10 ksoftirqd/6 20 root 20 0 0 0 0 S 0.3 0.0 0:20.95 ksoftirqd/2 9229 root 20 0 220m 16m 6168 S 0.3 0.2 0:00.78 perf 9578 root 20 0 15084 1156 856 R 0.3 0.0 0:00.01 top 1 root 20 0 23648 1644 1320 S 0.0 0.0 0:00.34 init Samples: 1K of event 'cpu-clock', Event count (approx.): 283500000 5.17% [kernel] [k] fib_table_lookup 5.02% [kernel] [k] __do_softirq 3.63% [kernel] [k] copy_user_generic_unrolled 3.55% perf [.] 0x000000000004c808 3.16% [kernel] [k] __netif_receive_skb_core 3.09% [kernel] [k] _raw_spin_unlock_irqrestore 2.93% [kernel] [k] enqueue_to_backlog 2.85% [kernel] [k] _raw_spin_lock 2.08% [kernel] [k] __pskb_pull_tail 1.93% [kernel] [k] __udp4_lib_lookup 1.85% [kernel] [k] ip_rcv 1.77% [kernel] [k] __slab_free 1.62% [kernel] [k] _raw_spin_unlock_irq 1.54% [kernel] [k] check_leaf 1.47% [kernel] [k] pvclock_clocksource_read 1.47% [kernel] [k] skb_copy_bits 1.31% [kernel] [k] __rcu_read_unlock 1.31% [kernel] [k] tcp_v4_rcv 1.23% [mlx4_en] [k] mlx4_en_process_rx_cq 1.00% [kernel] [k] skb_try_coalesce 1.00% [kernel] [k] napi_gro_frags 1.00% [kernel] [k] __inet_lookup_established 0.93% [mlx4_en] [k] mlx4_en_xmit ifconfig -a listing (eth2 is the NIC over which we run, br1 is the bridge) - no recorded drops [root@r-dcs47-005 ~]# ifconfig -a br1 Link encap:Ethernet HWaddr 1A:10:63:AD:55:4C inet addr:192.168.52.147 Bcast:192.168.52.255 Mask:255.255.255.0 inet6 addr: fe80::f88c:60ff:fe19:6d33/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:32120731 errors:0 dropped:0 overruns:0 frame:0 TX packets:987550 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:46552789260 (43.3 GiB) TX bytes:53462960 (50.9 MiB) eth0 Link encap:Ethernet HWaddr 00:50:56:25:4A:05 inet addr:10.212.74.5 Bcast:10.212.255.255 Mask:255.255.0.0 inet6 addr: fe80::250:56ff:fe25:4a05/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:265478 errors:0 dropped:16 overruns:0 frame:0 TX packets:14391 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:28861836 (27.5 MiB) TX bytes:2561303 (2.4 MiB) eth2 Link encap:Ethernet HWaddr 00:02:C9:E9:C0:82 inet addr:192.168.30.147 Bcast:192.168.30.255 Mask:255.255.255.0 inet6 addr: fe80::2:c900:1e9:c082/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:853813090 errors:0 dropped:0 overruns:0 frame:0 TX packets:76377493 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1270293325678 (1.1 TiB) TX bytes:7858334980 (7.3 GiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:48 errors:0 dropped:0 overruns:0 frame:0 TX packets:48 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:3236 (3.1 KiB) TX bytes:3236 (3.1 KiB) veth0 Link encap:Ethernet HWaddr EA:4F:C9:1C:5D:EE inet addr:192.168.62.147 Bcast:192.168.62.255 Mask:255.255.255.0 inet6 addr: fe80::e84f:c9ff:fe1c:5dee/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:372217768 errors:0 dropped:0 overruns:0 frame:0 TX packets:60732630 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:544449586684 (507.0 GiB) TX bytes:3840742148 (3.5 GiB) veth1 Link encap:Ethernet HWaddr 1A:10:63:AD:55:4C inet6 addr: fe80::1810:63ff:fead:554c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:60732693 errors:0 dropped:0 overruns:0 frame:0 TX packets:372217836 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3840746210 (3.5 GiB) TX bytes:544449686236 (507.0 GiB) vxlan42 Link encap:Ethernet HWaddr FE:EF:4E:C7:0F:06 inet addr:192.168.42.147 Bcast:192.168.42.255 Mask:255.255.255.0 inet6 addr: fe80::fcef:4eff:fec7:f06/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:404338687 errors:0 dropped:0 overruns:0 frame:0 TX packets:61720259 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:585791638636 (545.5 GiB) TX bytes:4881734478 (4.5 GiB) active side perf top Samples: 74K of event 'cpu-clock', Event count (approx.): 12653268255 14.74% [kernel] [k] __copy_user_nocache 7.91% [kernel] [k] csum_partial 7.51% [kernel] [k] _raw_spin_lock 6.38% [kernel] [k] _raw_spin_unlock_irqrestore 5.57% [kernel] [k] __do_softirq 4.24% [mlx4_en] [k] mlx4_en_xmit 2.40% [kernel] [k] __copy_skb_header 2.10% [kernel] [k] _raw_spin_unlock_irq 2.04% [kernel] [k] memcpy 1.92% [kernel] [k] fib_table_lookup 1.73% [kernel] [k] tcp_sendmsg 1.64% [kernel] [k] skb_segment 1.52% [kernel] [k] __slab_free 1.09% [kernel] [k] __alloc_skb 0.89% [kernel] [k] __slab_alloc 0.85% [kernel] [k] tcp_ack 0.83% [kernel] [k] __netif_receive_skb_core 0.81% [kernel] [k] __kmalloc_node_track_caller 0.75% [kernel] [k] ip_send_check 0.70% [kernel] [k] put_compound_page 0.70% [kernel] [k] ksize 0.67% [kernel] [k] pvclock_clocksource_read 0.65% [kernel] [k] dev_hard_start_xmit 0.61% [kernel] [k] kmem_cache_alloc_node 0.58% [kernel] [k] dev_queue_xmit 0.55% [kernel] [k] enqueue_to_backlog 0.52% [kernel] [k] __pskb_pull_tail 0.49% [kernel] [k] __iowrite64_copy 0.45% [kernel] [k] dev_queue_xmit_nit 0.45% [kernel] [k] skb_copy_bits 0.44% [kernel] [k] check_leaf 0.44% [kernel] [k] skb_release_data 0.44% [kernel] [k] get_page_from_freelist 0.43% [kernel] [k] inet_gso_segment 0.42% [mlx4_en] [k] mlx4_en_process_rx_cq 0.39% [kernel] [k] process_backlog 0.38% [kernel] [k] pskb_expand_head and top [root@r-dcs44-005 ~]# top top - 00:13:27 up 7:59, 3 users, load average: 1.11, 0.76, 0.44 Tasks: 129 total, 1 running, 128 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 10.3%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 2.7%si, 0.0%st Cpu1 : 0.3%us, 11.7%sy, 0.0%ni, 84.3%id, 0.0%wa, 0.0%hi, 3.7%si, 0.0%st Cpu2 : 0.5%us, 18.4%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 37.8%si, 0.0%st Cpu3 : 0.0%us, 5.1%sy, 0.0%ni, 94.3%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.3%us, 6.0%sy, 0.0%ni, 93.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu6 : 0.3%us, 7.4%sy, 0.0%ni, 92.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 24.4%sy, 0.0%ni, 42.1%id, 0.0%wa, 0.0%hi, 33.5%si, 0.0%st Mem: 8183236k total, 1378928k used, 6804308k free, 10436k buffers Swap: 2097148k total, 0k used, 2097148k free, 128068k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17220 root 20 0 165m 564 456 S 122.4 0.0 6:14.34 iperf 20 root 20 0 0 0 0 S 1.3 0.0 0:07.85 ksoftirqd/2 45 root 20 0 0 0 0 S 1.3 0.0 0:05.65 ksoftirqd/7 18404 root 20 0 220m 16m 6384 S 0.7 0.2 0:00.62 perf 35 root 20 0 0 0 0 S 0.3 0.0 0:07.73 ksoftirqd/5 1 root 20 0 23520 1624 1316 S 0.0 0.0 0:00.33 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd and ifconfig -a (eth6 is the NIC over which we run) root@r-dcs44-005 ~]# ifconfig -a br1 Link encap:Ethernet HWaddr 12:C0:46:32:46:6A inet addr:192.168.52.144 Bcast:192.168.52.255 Mask:255.255.255.0 inet6 addr: fe80::2815:30ff:fe06:b5f9/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:987575 errors:0 dropped:0 overruns:0 frame:0 TX packets:842756 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:39638896 (37.8 MiB) TX bytes:45313678898 (42.2 GiB) eth0 Link encap:Ethernet HWaddr 00:50:56:25:4B:05 inet addr:10.212.75.5 Bcast:10.212.255.255 Mask:255.255.0.0 inet6 addr: fe80::250:56ff:fe25:4b05/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:193220 errors:0 dropped:55 overruns:0 frame:0 TX packets:17721 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:17006297 (16.2 MiB) TX bytes:2871741 (2.7 MiB) eth6 Link encap:Ethernet HWaddr 00:02:C9:E9:BB:B2 inet addr:192.168.30.144 Bcast:192.168.30.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:101512678 errors:0 dropped:0 overruns:0 frame:0 TX packets:995876432 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:10707213944 (9.9 GiB) TX bytes:1485292875970 (1.3 TiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:36 errors:0 dropped:0 overruns:0 frame:0 TX packets:36 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2456 (2.3 KiB) TX bytes:2456 (2.3 KiB) veth0 Link encap:Ethernet HWaddr 32:7D:34:FA:3A:A1 inet addr:192.168.62.144 Bcast:192.168.62.255 Mask:255.255.255.0 inet6 addr: fe80::307d:34ff:fefa:3aa1/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:85847361 errors:0 dropped:0 overruns:0 frame:0 TX packets:72371900 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5431514098 (5.0 GiB) TX bytes:728410479216 (678.3 GiB) veth1 Link encap:Ethernet HWaddr 12:C0:46:32:46:6A inet6 addr: fe80::10c0:46ff:fe32:466a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:72371915 errors:0 dropped:0 overruns:0 frame:0 TX packets:85847368 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:728410821246 (678.3 GiB) TX bytes:5431514476 (5.0 GiB) vxlan42 Link encap:Ethernet HWaddr B2:F9:D2:68:A3:11 inet addr:192.168.42.144 Bcast:192.168.42.255 Mask:255.255.255.0 inet6 addr: fe80::b0f9:d2ff:fe68:a311/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:86834927 errors:0 dropped:0 overruns:0 frame:0 TX packets:73214688 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:4269288656 (3.9 GiB) TX bytes:774896089976 (721.6 GiB) no drops ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 21:55 ` Eric Dumazet 2013-12-03 22:15 ` Or Gerlitz @ 2013-12-03 22:22 ` Or Gerlitz 2013-12-03 22:30 ` Hannes Frederic Sowa 2013-12-03 23:10 ` Or Gerlitz 2 siblings, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 22:22 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, netdev@vger.kernel.org On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote: > >> At this point I would use drop monitor to figure out in what context >> packets are being dropped on the floor. There are scripts provided >> with the perf tool to utilize it. > > Most easy way is to do : > > perf record -e skb:kfree_skb -a -g sleep 10 some typo here? I tried the perf tool that comes with the net git ./perf record -e skb:kfree_skb -a -g sleep 10 invalid or unsupported event: 'skb:kfree_skb' Run 'perf list' for a list of valid events usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>] -e, --event <event> event selector. use 'perf list' to list available events > > perf report > > > anything critical missing here? perf]# make BUILD: Doing 'make -j16' parallel build Auto-detecting system features: ... backtrace: [ on ] ... dwarf: [ on ] ... fortify-source: [ on ] ... glibc: [ on ] ... gtk2: [ OFF ] ... gtk2-infobar: [ OFF ] ... libaudit: [ OFF ] ... libbfd: [ on ] ... libelf: [ on ] ... libelf-getphdrnum: [ on ] ... libelf-mmap: [ on ] ... libnuma: [ OFF ] ... libperl: [ on ] ... libpython: [ on ] ... libpython-version: [ on ] ... libslang: [ on ] ... libunwind: [ OFF ] ... on-exit: [ on ] ... stackprotector: [ on ] ... stackprotector-all: [ on ] ... timerfd: [ on ] config/Makefile:329: No libunwind found, disabling post unwind support. Please install libunwind-dev[el] >= 1.1 config/Makefile:354: No libaudit.h found, disables 'trace' tool, please install audit-libs-devel or libaudit-dev config/Makefile:381: GTK2 not found, disables GTK2 support. Please install gtk2-devel or libgtk2.0-dev config/Makefile:536: No numa.h found, disables 'perf bench numa mem' benchmark, please install numa-libs-devel or libnuma-dev ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 22:22 ` Or Gerlitz @ 2013-12-03 22:30 ` Hannes Frederic Sowa 2013-12-03 22:35 ` Or Gerlitz 0 siblings, 1 reply; 63+ messages in thread From: Hannes Frederic Sowa @ 2013-12-03 22:30 UTC (permalink / raw) To: Or Gerlitz Cc: Eric Dumazet, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, netdev@vger.kernel.org On Wed, Dec 04, 2013 at 12:22:08AM +0200, Or Gerlitz wrote: > On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote: > > > >> At this point I would use drop monitor to figure out in what context > >> packets are being dropped on the floor. There are scripts provided > >> with the perf tool to utilize it. > > > > Most easy way is to do : > > > > perf record -e skb:kfree_skb -a -g sleep 10 > > some typo here? I tried the perf tool that comes with the net git > > ./perf record -e skb:kfree_skb -a -g sleep 10 > invalid or unsupported event: 'skb:kfree_skb' > Run 'perf list' for a list of valid events > > usage: perf record [<options>] [<command>] > or: perf record [<options>] -- <command> [<options>] > > -e, --event <event> event selector. use 'perf list' to list > available events -g takes an optional argument. Reorder the arguments: perf record -e skb:kfree_skb -g -a sleep 10 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 22:30 ` Hannes Frederic Sowa @ 2013-12-03 22:35 ` Or Gerlitz 2013-12-03 22:39 ` Hannes Frederic Sowa 0 siblings, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 22:35 UTC (permalink / raw) To: Or Gerlitz, Eric Dumazet, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, netdev@vger.kernel.org On Wed, Dec 4, 2013 at 12:30 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > On Wed, Dec 04, 2013 at 12:22:08AM +0200, Or Gerlitz wrote: >> On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote: >> > >> >> At this point I would use drop monitor to figure out in what context >> >> packets are being dropped on the floor. There are scripts provided >> >> with the perf tool to utilize it. >> > >> > Most easy way is to do : >> > >> > perf record -e skb:kfree_skb -a -g sleep 10 >> >> some typo here? I tried the perf tool that comes with the net git >> >> ./perf record -e skb:kfree_skb -a -g sleep 10 >> invalid or unsupported event: 'skb:kfree_skb' >> Run 'perf list' for a list of valid events >> >> usage: perf record [<options>] [<command>] >> or: perf record [<options>] -- <command> [<options>] >> >> -e, --event <event> event selector. use 'perf list' to list >> available events > > -g takes an optional argument. Reorder the arguments: > > perf record -e skb:kfree_skb -g -a sleep 10 Sorry, it doesn't help, I get the same error even with the oerf kernel.org tree which has these events (no sign for skb's) List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] ref-cycles [Hardware event] cpu-clock [Software event] task-clock [Software event] page-faults OR faults [Software event] context-switches OR cs [Software event] cpu-migrations OR migrations [Software event] minor-faults [Software event] major-faults [Software event] alignment-faults [Software event] emulation-faults [Software event] L1-dcache-loads [Hardware cache event] L1-dcache-load-misses [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-dcache-store-misses [Hardware cache event] L1-dcache-prefetches [Hardware cache event] L1-dcache-prefetch-misses [Hardware cache event] L1-icache-loads [Hardware cache event] L1-icache-load-misses [Hardware cache event] L1-icache-prefetches [Hardware cache event] L1-icache-prefetch-misses [Hardware cache event] LLC-loads [Hardware cache event] LLC-load-misses [Hardware cache event] LLC-stores [Hardware cache event] LLC-store-misses [Hardware cache event] LLC-prefetches [Hardware cache event] LLC-prefetch-misses [Hardware cache event] dTLB-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-stores [Hardware cache event] dTLB-store-misses [Hardware cache event] dTLB-prefetches [Hardware cache event] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 22:35 ` Or Gerlitz @ 2013-12-03 22:39 ` Hannes Frederic Sowa 0 siblings, 0 replies; 63+ messages in thread From: Hannes Frederic Sowa @ 2013-12-03 22:39 UTC (permalink / raw) To: Or Gerlitz Cc: Eric Dumazet, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, netdev@vger.kernel.org On Wed, Dec 04, 2013 at 12:35:37AM +0200, Or Gerlitz wrote: > Sorry, it doesn't help, I get the same error even with the oerf > kernel.org tree which has these events (no sign for skb's) > > > List of pre-defined events (to be used in -e): > cpu-cycles OR cycles [Hardware event] > instructions [Hardware event] > cache-references [Hardware event] > cache-misses [Hardware event] > branch-instructions OR branches [Hardware event] > branch-misses [Hardware event] > bus-cycles [Hardware event] > stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] > stalled-cycles-backend OR idle-cycles-backend [Hardware event] > ref-cycles [Hardware event] > > cpu-clock [Software event] > task-clock [Software event] > page-faults OR faults [Software event] > context-switches OR cs [Software event] > cpu-migrations OR migrations [Software event] > minor-faults [Software event] > major-faults [Software event] > alignment-faults [Software event] > emulation-faults [Software event] > > L1-dcache-loads [Hardware cache event] > L1-dcache-load-misses [Hardware cache event] > L1-dcache-stores [Hardware cache event] > L1-dcache-store-misses [Hardware cache event] > L1-dcache-prefetches [Hardware cache event] > L1-dcache-prefetch-misses [Hardware cache event] > L1-icache-loads [Hardware cache event] > L1-icache-load-misses [Hardware cache event] > L1-icache-prefetches [Hardware cache event] > L1-icache-prefetch-misses [Hardware cache event] > LLC-loads [Hardware cache event] > LLC-load-misses [Hardware cache event] > LLC-stores [Hardware cache event] > LLC-store-misses [Hardware cache event] > LLC-prefetches [Hardware cache event] > LLC-prefetch-misses [Hardware cache event] > dTLB-loads [Hardware cache event] > dTLB-load-misses [Hardware cache event] > dTLB-stores [Hardware cache event] > dTLB-store-misses [Hardware cache event] > dTLB-prefetches [Hardware cache event] Is this the whole output of perf list? Then you seem to be missing some tracepoint options, CONFIG_TRACEPOINTS e.g.? I can confirm it works for me on a current net build. Greetings, Hannes ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 21:55 ` Eric Dumazet 2013-12-03 22:15 ` Or Gerlitz 2013-12-03 22:22 ` Or Gerlitz @ 2013-12-03 23:10 ` Or Gerlitz 2013-12-03 23:30 ` Or Gerlitz 2013-12-03 23:59 ` Eric Dumazet 2 siblings, 2 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 23:10 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, netdev@vger.kernel.org On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote: > >> At this point I would use drop monitor to figure out in what context >> packets are being dropped on the floor. There are scripts provided >> with the perf tool to utilize it. > > Most easy way is to do : > > perf record -e skb:kfree_skb -a -g sleep 10 > perf report $ ./perf record -e skb:kfree_skb -g -a sleep 10 $ ./perf report -i perf.data Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406 + 97.13% swapper [kernel.kallsyms] [k] net_tx_action + 1.53% iperf [kernel.kallsyms] [k] net_tx_action + 1.03% perf [kernel.kallsyms] [k] net_tx_action + 0.27% ksoftirqd/7 [kernel.kallsyms] [k] net_tx_action + 0.03% kworker/7:1 [kernel.kallsyms] [k] net_tx_action + 0.00% rpcbind [kernel.kallsyms] [k] net_tx_action + 0.00% swapper [kernel.kallsyms] [k] kfree_skb + 0.00% sleep [kernel.kallsyms] [k] net_tx_action + 0.00% hald-addon-acpi [kernel.kallsyms] [k] kfree_skb + 0.00% iperf [kernel.kallsyms] [k] kfree_skb + 0.00% perf [kernel.kallsyms] [k] kfree_skb > > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 23:10 ` Or Gerlitz @ 2013-12-03 23:30 ` Or Gerlitz 2013-12-03 23:49 ` Hannes Frederic Sowa 2013-12-03 23:59 ` Eric Dumazet 1 sibling, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 23:30 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, netdev@vger.kernel.org On Wed, Dec 4, 2013 at 1:10 AM, Or Gerlitz <or.gerlitz@gmail.com> wrote: > On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote: >> >>> At this point I would use drop monitor to figure out in what context >>> packets are being dropped on the floor. There are scripts provided >>> with the perf tool to utilize it. >> >> Most easy way is to do : >> >> perf record -e skb:kfree_skb -a -g sleep 10 >> perf report > > $ ./perf record -e skb:kfree_skb -g -a sleep 10 > $ ./perf report -i perf.data > > > Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406 > + 97.13% swapper [kernel.kallsyms] [k] net_tx_action > + 1.53% iperf [kernel.kallsyms] [k] net_tx_action > + 1.03% perf [kernel.kallsyms] [k] net_tx_action > + 0.27% ksoftirqd/7 [kernel.kallsyms] [k] net_tx_action > + 0.03% kworker/7:1 [kernel.kallsyms] [k] net_tx_action > + 0.00% rpcbind [kernel.kallsyms] [k] net_tx_action > + 0.00% swapper [kernel.kallsyms] [k] kfree_skb > + 0.00% sleep [kernel.kallsyms] [k] net_tx_action > + 0.00% hald-addon-acpi [kernel.kallsyms] [k] kfree_skb > + 0.00% iperf [kernel.kallsyms] [k] kfree_skb > + 0.00% perf [kernel.kallsyms] [k] kfree_skb I added proper sorting (thanks Rick), here's the passive side report - 99.99% [kernel.kallsyms] [k] net_tx_action ? - net_tx_action ? - 59.20% __libc_recv ? 100.00% 0 ? - 35.61% __write_nocancel ? - 100.00% run_builtin ? main ? __libc_start_main ? - 2.60% __poll ? - 92.62% run_builtin ? main ? __libc_start_main ? - 7.38% 0x7f2d4c4adfd4 ? 0x626370523a32333a ? - 1.86% cmd_record ? run_builtin ? main ? __libc_start_main ? - 0.01% [kernel.kallsyms] [k] kfree_skb ? - kfree_skb ? - 50.00% __connect_nocancel ? 0x64697063612f6e75 ? - 33.33% __libc_recv ? 0 ? - 16.67% __write_nocancel ? run_builtin ? main ? __libc_start_main and the active side report 100.00% [kernel.kallsyms] [k] net_tx_action ? - net_tx_action ? - 76.91% __write_nocancel ? - 100.00% run_builtin ? main ? __libc_start_main ? 15.92% 0x37cc60e4ed ? - 2.69% __poll ? run_builtin ? main ? __libc_start_main ? - 1.92% cmd_record ? run_builtin ? main ? __libc_start_main ? - 1.66% pthread_cond_signal@@GLIBC_2. 3.2 ? 0x10000 ? - 0.52% perf_header__has_feat ? - 73.28% run_builtin ? main ? __libc_start_main ? - 26.72% cmd_record ? run_builtin ? main ? __libc_start_main ? - 0.00% [kernel.kallsyms] [k] kfree_skb ? - kfree_skb ? - 100.00% __GI___connect_internal ? - 50.00% get_mapping ? __nscd_get_map_ref ? 50.00% __nscd_open_socket > > > >> >> >> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 23:30 ` Or Gerlitz @ 2013-12-03 23:49 ` Hannes Frederic Sowa 0 siblings, 0 replies; 63+ messages in thread From: Hannes Frederic Sowa @ 2013-12-03 23:49 UTC (permalink / raw) To: Or Gerlitz Cc: Eric Dumazet, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, netdev@vger.kernel.org On Wed, Dec 04, 2013 at 01:30:09AM +0200, Or Gerlitz wrote: > On Wed, Dec 4, 2013 at 1:10 AM, Or Gerlitz <or.gerlitz@gmail.com> wrote: > > On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote: > >> > >>> At this point I would use drop monitor to figure out in what context > >>> packets are being dropped on the floor. There are scripts provided > >>> with the perf tool to utilize it. > >> > >> Most easy way is to do : > >> > >> perf record -e skb:kfree_skb -a -g sleep 10 > >> perf report > > > > $ ./perf record -e skb:kfree_skb -g -a sleep 10 > > $ ./perf report -i perf.data > > > > > > Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406 > > + 97.13% swapper [kernel.kallsyms] [k] net_tx_action > > + 1.53% iperf [kernel.kallsyms] [k] net_tx_action > > + 1.03% perf [kernel.kallsyms] [k] net_tx_action > > + 0.27% ksoftirqd/7 [kernel.kallsyms] [k] net_tx_action > > + 0.03% kworker/7:1 [kernel.kallsyms] [k] net_tx_action > > + 0.00% rpcbind [kernel.kallsyms] [k] net_tx_action > > + 0.00% swapper [kernel.kallsyms] [k] kfree_skb > > + 0.00% sleep [kernel.kallsyms] [k] net_tx_action > > + 0.00% hald-addon-acpi [kernel.kallsyms] [k] kfree_skb > > + 0.00% iperf [kernel.kallsyms] [k] kfree_skb > > + 0.00% perf [kernel.kallsyms] [k] kfree_skb > > I added proper sorting (thanks Rick), here's the passive side report Btw. the nice helper for the dropwatch is "perf script net_dropmonitor". Greetings, Hannes ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 23:10 ` Or Gerlitz 2013-12-03 23:30 ` Or Gerlitz @ 2013-12-03 23:59 ` Eric Dumazet 2013-12-04 0:26 ` Alexei Starovoitov ` (2 more replies) 1 sibling, 3 replies; 63+ messages in thread From: Eric Dumazet @ 2013-12-03 23:59 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, netdev@vger.kernel.org On Wed, 2013-12-04 at 01:10 +0200, Or Gerlitz wrote: > Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406 > + 97.13% swapper [kernel.kallsyms] [k] net_tx_action > + 1.53% iperf [kernel.kallsyms] [k] net_tx_action > + 1.03% perf [kernel.kallsyms] [k] net_tx_action > + 0.27% ksoftirqd/7 [kernel.kallsyms] [k] net_tx_action > + 0.03% kworker/7:1 [kernel.kallsyms] [k] net_tx_action > + 0.00% rpcbind [kernel.kallsyms] [k] net_tx_action > + 0.00% swapper [kernel.kallsyms] [k] kfree_skb > + 0.00% sleep [kernel.kallsyms] [k] net_tx_action > + 0.00% hald-addon-acpi [kernel.kallsyms] [k] kfree_skb > + 0.00% iperf [kernel.kallsyms] [k] kfree_skb > + 0.00% perf [kernel.kallsyms] [k] kfree_skb > Right, I actually have a patch for that, but was waiting for net-next being re-opened : commit 9a731d750dd8bf0b8c20fb1ca53c42317fb4dd37 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 25 13:09:20 2013 -0800 net-fixes: introduce dev_consume_skb_any() Some NIC drivers use dev_kfree_skb_any() generic helper to free skbs, both for dropped packets and TX completed ones. To have "perf record -e skb:kfree_skb" give good hints on where packets are dropped (if any), we need to separate the two causes. dev_consume_skb_any() is a helper to free skbs that were properly sent to the wire. Signed-off-by: Eric Dumazet <edumazet@google.com> Google-Bug-Id: 11634401 --- diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c index ec96130533cc..8b8c2171b187 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c @@ -209,9 +209,9 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x *bp, struct bnx2x_fp_txdata *txdata, if (likely(skb)) { (*pkts_compl)++; (*bytes_compl) += skb->len; + dev_consume_skb_any(skb); } - dev_kfree_skb_any(skb); tx_buf->first_bd = 0; tx_buf->skb = NULL; diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 8d3945ab7334..03081932e519 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1058,7 +1058,7 @@ static void e1000_put_txbuf(struct e1000_ring *tx_ring, buffer_info->dma = 0; } if (buffer_info->skb) { - dev_kfree_skb_any(buffer_info->skb); + dev_consume_skb_any(buffer_info->skb); buffer_info->skb = NULL; } buffer_info->time_stamp = 0; diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c index 43aa7acd84a6..294825efb248 100644 --- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -2037,7 +2037,7 @@ static void sky2_tx_complete(struct sky2_port *sky2, u16 done) bytes_compl += skb->len; re->skb = NULL; - dev_kfree_skb_any(skb); + dev_consume_skb_any(skb); sky2->tx_next = RING_NEXT(idx, sky2->tx_ring_size); } diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index f54ebd5a1702..653484bfae98 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -317,7 +317,7 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv, } } } - dev_kfree_skb_any(skb); + dev_consume_skb_any(skb); return tx_info->nr_txbb; } diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c index 2d045be4b5cf..d44f7b69a6a0 100644 --- a/drivers/net/ethernet/nvidia/forcedeth.c +++ b/drivers/net/ethernet/nvidia/forcedeth.c @@ -2557,7 +2557,7 @@ static int nv_tx_done(struct net_device *dev, int limit) u64_stats_update_end(&np->swstats_tx_syncp); } bytes_compl += np->get_tx_ctx->skb->len; - dev_kfree_skb_any(np->get_tx_ctx->skb); + dev_consume_skb_any(np->get_tx_ctx->skb); np->get_tx_ctx->skb = NULL; tx_work++; } @@ -2574,7 +2574,7 @@ static int nv_tx_done(struct net_device *dev, int limit) u64_stats_update_end(&np->swstats_tx_syncp); } bytes_compl += np->get_tx_ctx->skb->len; - dev_kfree_skb_any(np->get_tx_ctx->skb); + dev_consume_skb_any(np->get_tx_ctx->skb); np->get_tx_ctx->skb = NULL; tx_work++; } @@ -2625,7 +2625,7 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit) } bytes_cleaned += np->get_tx_ctx->skb->len; - dev_kfree_skb_any(np->get_tx_ctx->skb); + dev_consume_skb_any(np->get_tx_ctx->skb); np->get_tx_ctx->skb = NULL; tx_work++; diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 916241d16c67..ab8970693ff3 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -755,7 +755,7 @@ static void free_old_xmit_skbs(struct send_queue *sq) stats->tx_packets++; u64_stats_update_end(&stats->tx_syncp); - dev_kfree_skb_any(skb); + dev_consume_skb_any(skb); } } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7f0ed423a360..8a7482fa2656 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2380,6 +2380,13 @@ void dev_kfree_skb_irq(struct sk_buff *skb); */ void dev_kfree_skb_any(struct sk_buff *skb); +#define SKB_CONSUMED_MAGIC ((void *)0xDEAD0001) +static inline void dev_consume_skb_any(struct sk_buff *skb) +{ + skb->dev = SKB_CONSUMED_MAGIC; + dev_kfree_skb_any(skb); +} + int netif_rx(struct sk_buff *skb); int netif_rx_ni(struct sk_buff *skb); int netif_receive_skb(struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index ba3b7ea5ebb3..b2b0e5776ce9 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3306,7 +3306,10 @@ static void net_tx_action(struct softirq_action *h) clist = clist->next; WARN_ON(atomic_read(&skb->users)); - trace_kfree_skb(skb, net_tx_action); + if (likely(skb->dev == SKB_CONSUMED_MAGIC)) + trace_consume_skb(skb); + else + trace_kfree_skb(skb, net_tx_action); __kfree_skb(skb); } } ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 23:59 ` Eric Dumazet @ 2013-12-04 0:26 ` Alexei Starovoitov 2013-12-04 0:36 ` Eric Dumazet 2013-12-04 6:39 ` David Miller 2013-12-05 12:45 ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet 2 siblings, 1 reply; 63+ messages in thread From: Alexei Starovoitov @ 2013-12-04 0:26 UTC (permalink / raw) To: Eric Dumazet Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org On Tue, Dec 3, 2013 at 3:59 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > +#define SKB_CONSUMED_MAGIC ((void *)0xDEAD0001) > +static inline void dev_consume_skb_any(struct sk_buff *skb) > +{ > + skb->dev = SKB_CONSUMED_MAGIC; > + dev_kfree_skb_any(skb); > +} > + > int netif_rx(struct sk_buff *skb); > int netif_rx_ni(struct sk_buff *skb); > int netif_receive_skb(struct sk_buff *skb); > diff --git a/net/core/dev.c b/net/core/dev.c > index ba3b7ea5ebb3..b2b0e5776ce9 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3306,7 +3306,10 @@ static void net_tx_action(struct softirq_action *h) > clist = clist->next; > > WARN_ON(atomic_read(&skb->users)); > - trace_kfree_skb(skb, net_tx_action); > + if (likely(skb->dev == SKB_CONSUMED_MAGIC)) > + trace_consume_skb(skb); > + else > + trace_kfree_skb(skb, net_tx_action); Could you use some other way to mark skb ? In tracing we might want to examine skb more carefully and not being able to see the device will limit the usability of this tracepoint. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 0:26 ` Alexei Starovoitov @ 2013-12-04 0:36 ` Eric Dumazet 2013-12-04 0:55 ` Alexei Starovoitov 0 siblings, 1 reply; 63+ messages in thread From: Eric Dumazet @ 2013-12-04 0:36 UTC (permalink / raw) To: Alexei Starovoitov Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote: > > Could you use some other way to mark skb ? I could ;) > In tracing we might want to examine skb more carefully and not being > able to see the device > will limit the usability of this tracepoint. Unfortunately, using skb->dev as a pointer to device would be buggy or expensive (you would need to take a reference on device in order not letting it disappear, as we escape RCU protection) Current TRACE_EVENT for trace_consume_skb() does not use skb->dev. Anyway, this magic is pretty easy to change, I am open to suggestions. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 0:36 ` Eric Dumazet @ 2013-12-04 0:55 ` Alexei Starovoitov 2013-12-04 1:23 ` Eric Dumazet 0 siblings, 1 reply; 63+ messages in thread From: Alexei Starovoitov @ 2013-12-04 0:55 UTC (permalink / raw) To: Eric Dumazet Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org On Tue, Dec 3, 2013 at 4:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote: > >> >> Could you use some other way to mark skb ? > > I could ;) > >> In tracing we might want to examine skb more carefully and not being >> able to see the device >> will limit the usability of this tracepoint. > > Unfortunately, using skb->dev as a pointer to device would be buggy or > expensive (you would need to take a reference on device in order not > letting it disappear, as we escape RCU protection) well, yes, you might have an skb around when device is already freed when skb_dst_noref. but I'm not suggesting anything expensive. Tracing definitely should not add overhead by doing rcu_lock() or dev_hold(). Instead it can go through skb, skb->dev, skb->dev->xxx via probe_kernel_read(). If dev is gone, it's still safe. > Anyway, this magic is pretty easy to change, I am open to suggestions. you're the expert :) use skb->mark field, since it's unused during freeing path... ? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 0:55 ` Alexei Starovoitov @ 2013-12-04 1:23 ` Eric Dumazet 2013-12-04 1:59 ` Alexei Starovoitov 2013-12-06 9:06 ` Or Gerlitz 0 siblings, 2 replies; 63+ messages in thread From: Eric Dumazet @ 2013-12-04 1:23 UTC (permalink / raw) To: Alexei Starovoitov Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org On Tue, 2013-12-03 at 16:55 -0800, Alexei Starovoitov wrote: > On Tue, Dec 3, 2013 at 4:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote: > > > >> > >> Could you use some other way to mark skb ? > > > > I could ;) > > > >> In tracing we might want to examine skb more carefully and not being > >> able to see the device > >> will limit the usability of this tracepoint. > > > > Unfortunately, using skb->dev as a pointer to device would be buggy or > > expensive (you would need to take a reference on device in order not > > letting it disappear, as we escape RCU protection) > > well, yes, you might have an skb around when device is already freed > when skb_dst_noref. > but I'm not suggesting anything expensive. Tracing definitely should > not add overhead by doing rcu_lock() or dev_hold(). Instead it can go > through skb, skb->dev, skb->dev->xxx via probe_kernel_read(). If dev > is gone, it's still safe. Its certainly not safe to 'probe'. Its not about faulting inexistent memory, that is the least of the problem. Any kind of information fetched from skb->dev might have been overwritten. You could for example fetch security sensitive data and expose it. > > > Anyway, this magic is pretty easy to change, I am open to suggestions. > > you're the expert :) use skb->mark field, since it's unused during > freeing path... ? cache line miss ;) skb->dev is in the first cache line, where we access skb->next anyway. I could use skb->cb[] like the following patch : diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c index ec96130533cc..8b8c2171b187 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c @@ -209,9 +209,9 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x *bp, struct bnx2x_fp_txdata *txdata, if (likely(skb)) { (*pkts_compl)++; (*bytes_compl) += skb->len; + dev_consume_skb_any(skb); } - dev_kfree_skb_any(skb); tx_buf->first_bd = 0; tx_buf->skb = NULL; diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 8d3945ab7334..03081932e519 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1058,7 +1058,7 @@ static void e1000_put_txbuf(struct e1000_ring *tx_ring, buffer_info->dma = 0; } if (buffer_info->skb) { - dev_kfree_skb_any(buffer_info->skb); + dev_consume_skb_any(buffer_info->skb); buffer_info->skb = NULL; } buffer_info->time_stamp = 0; diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c index 43aa7acd84a6..294825efb248 100644 --- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -2037,7 +2037,7 @@ static void sky2_tx_complete(struct sky2_port *sky2, u16 done) bytes_compl += skb->len; re->skb = NULL; - dev_kfree_skb_any(skb); + dev_consume_skb_any(skb); sky2->tx_next = RING_NEXT(idx, sky2->tx_ring_size); } diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index f54ebd5a1702..653484bfae98 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -317,7 +317,7 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv, } } } - dev_kfree_skb_any(skb); + dev_consume_skb_any(skb); return tx_info->nr_txbb; } diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c index 2d045be4b5cf..d44f7b69a6a0 100644 --- a/drivers/net/ethernet/nvidia/forcedeth.c +++ b/drivers/net/ethernet/nvidia/forcedeth.c @@ -2557,7 +2557,7 @@ static int nv_tx_done(struct net_device *dev, int limit) u64_stats_update_end(&np->swstats_tx_syncp); } bytes_compl += np->get_tx_ctx->skb->len; - dev_kfree_skb_any(np->get_tx_ctx->skb); + dev_consume_skb_any(np->get_tx_ctx->skb); np->get_tx_ctx->skb = NULL; tx_work++; } @@ -2574,7 +2574,7 @@ static int nv_tx_done(struct net_device *dev, int limit) u64_stats_update_end(&np->swstats_tx_syncp); } bytes_compl += np->get_tx_ctx->skb->len; - dev_kfree_skb_any(np->get_tx_ctx->skb); + dev_consume_skb_any(np->get_tx_ctx->skb); np->get_tx_ctx->skb = NULL; tx_work++; } @@ -2625,7 +2625,7 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit) } bytes_cleaned += np->get_tx_ctx->skb->len; - dev_kfree_skb_any(np->get_tx_ctx->skb); + dev_consume_skb_any(np->get_tx_ctx->skb); np->get_tx_ctx->skb = NULL; tx_work++; diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 916241d16c67..ab8970693ff3 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -755,7 +755,7 @@ static void free_old_xmit_skbs(struct send_queue *sq) stats->tx_packets++; u64_stats_update_end(&stats->tx_syncp); - dev_kfree_skb_any(skb); + dev_consume_skb_any(skb); } } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7f0ed423a360..8b80a58ec1ac 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2374,11 +2374,38 @@ int netif_get_num_default_rss_queues(void); */ void dev_kfree_skb_irq(struct sk_buff *skb); +void __dev_kfree_skb_any(struct sk_buff *skb); + +struct __dev_kfree_skb_cb { + unsigned int reason; +}; + +static inline struct __dev_kfree_skb_cb *get_kfree_skb_cb(const struct sk_buff *skb) +{ + return (struct __dev_kfree_skb_cb *)skb->cb; +} + +enum { + SKB_REASON_CONSUMED, + SKB_REASON_DROPPED, +}; + /* Use this variant in places where it could be invoked * from either hardware interrupt or other context, with hardware interrupts * either disabled or enabled. + * Note that TX completion should use dev_consume_skb_any() */ -void dev_kfree_skb_any(struct sk_buff *skb); +static inline void dev_kfree_skb_any(struct sk_buff *skb) +{ + get_kfree_skb_cb(skb)->reason = SKB_REASON_DROPPED; + __dev_kfree_skb_any(skb); +} + +static inline void dev_consume_skb_any(struct sk_buff *skb) +{ + get_kfree_skb_cb(skb)->reason = SKB_REASON_CONSUMED; + __dev_kfree_skb_any(skb); +} int netif_rx(struct sk_buff *skb); int netif_rx_ni(struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index ba3b7ea5ebb3..3170776e53da 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2161,14 +2161,14 @@ void dev_kfree_skb_irq(struct sk_buff *skb) } EXPORT_SYMBOL(dev_kfree_skb_irq); -void dev_kfree_skb_any(struct sk_buff *skb) +void __dev_kfree_skb_any(struct sk_buff *skb) { if (in_irq() || irqs_disabled()) dev_kfree_skb_irq(skb); else dev_kfree_skb(skb); } -EXPORT_SYMBOL(dev_kfree_skb_any); +EXPORT_SYMBOL(__dev_kfree_skb_any); /** @@ -3306,7 +3306,10 @@ static void net_tx_action(struct softirq_action *h) clist = clist->next; WARN_ON(atomic_read(&skb->users)); - trace_kfree_skb(skb, net_tx_action); + if (likely(get_kfree_skb_cb(skb)->reason == SKB_REASON_CONSUMED)) + trace_consume_skb(skb); + else + trace_kfree_skb(skb, net_tx_action); __kfree_skb(skb); } } ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 1:23 ` Eric Dumazet @ 2013-12-04 1:59 ` Alexei Starovoitov 2013-12-06 9:06 ` Or Gerlitz 1 sibling, 0 replies; 63+ messages in thread From: Alexei Starovoitov @ 2013-12-04 1:59 UTC (permalink / raw) To: Eric Dumazet Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org On Tue, Dec 3, 2013 at 5:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2013-12-03 at 16:55 -0800, Alexei Starovoitov wrote: >> On Tue, Dec 3, 2013 at 4:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote: >> > >> >> >> >> Could you use some other way to mark skb ? >> > >> > I could ;) >> > >> >> In tracing we might want to examine skb more carefully and not being >> >> able to see the device >> >> will limit the usability of this tracepoint. >> > >> > Unfortunately, using skb->dev as a pointer to device would be buggy or >> > expensive (you would need to take a reference on device in order not >> > letting it disappear, as we escape RCU protection) >> >> well, yes, you might have an skb around when device is already freed >> when skb_dst_noref. >> but I'm not suggesting anything expensive. Tracing definitely should >> not add overhead by doing rcu_lock() or dev_hold(). Instead it can go >> through skb, skb->dev, skb->dev->xxx via probe_kernel_read(). If dev >> is gone, it's still safe. > > Its certainly not safe to 'probe'. > > Its not about faulting inexistent memory, that is the least of the > problem. > > Any kind of information fetched from skb->dev might have been > overwritten. > > You could for example fetch security sensitive data and expose it. Of course. Even without walking pointer chains with probe() you could infer all sorts of info from tracepoints. That's why tracing filters are for root only. >> > Anyway, this magic is pretty easy to change, I am open to suggestions. >> >> you're the expert :) use skb->mark field, since it's unused during >> freeing path... ? > > cache line miss ;) > > skb->dev is in the first cache line, where we access skb->next anyway. > > I could use skb->cb[] like the following patch : > > +struct __dev_kfree_skb_cb { > + unsigned int reason; > +}; > + > +static inline struct __dev_kfree_skb_cb *get_kfree_skb_cb(const struct sk_buff *skb) > +{ > + return (struct __dev_kfree_skb_cb *)skb->cb; > +} > + > +enum { > + SKB_REASON_CONSUMED, > + SKB_REASON_DROPPED, > +}; > + > /* Use this variant in places where it could be invoked > * from either hardware interrupt or other context, with hardware interrupts > * either disabled or enabled. > + * Note that TX completion should use dev_consume_skb_any() > */ > -void dev_kfree_skb_any(struct sk_buff *skb); > +static inline void dev_kfree_skb_any(struct sk_buff *skb) > +{ > + get_kfree_skb_cb(skb)->reason = SKB_REASON_DROPPED; > + __dev_kfree_skb_any(skb); > +} > + > +static inline void dev_consume_skb_any(struct sk_buff *skb) > +{ > + get_kfree_skb_cb(skb)->reason = SKB_REASON_CONSUMED; > + __dev_kfree_skb_any(skb); > +} thanks. I think that is much cleaner. Ack. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 1:23 ` Eric Dumazet 2013-12-04 1:59 ` Alexei Starovoitov @ 2013-12-06 9:06 ` Or Gerlitz 2013-12-06 13:36 ` Eric Dumazet 1 sibling, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-06 9:06 UTC (permalink / raw) To: Eric Dumazet Cc: Alexei Starovoitov, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org On Wed, Dec 4, 2013 at 3:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > skb->dev is in the first cache line, where we access skb->next anyway. > I could use skb->cb[] like the following patch : Hi Eric, I applied on the net tree the patch you posted yesterday "net: introduce dev_consume_skb_any()" along with the network drivers part of this patch, unless I got it wrong, I assume both pieces are needed? So I re-run the vxlan/veth test that we suspect goes through packet drops on TX. With the patches applied I have almost no samples of that event $ ./perf report -i perf.data Samples: 89 of event 'skb:kfree_skb', Event count (approx.): 89 + 39.33% ksoftirqd/2 [kernel.kallsyms] [k] net_tx_action ? + 28.09% swapper [kernel.kallsyms] [k] net_tx_action ? + 28.09% sshd [kernel.kallsyms] [k] net_tx_action ? + 2.25% swapper [kernel.kallsyms] [k] kfree_skb ? + 1.12% kworker/2:2 [kernel.kallsyms] [k] net_tx_action ? + 1.12% iperf [kernel.kallsyms] [k] net_tx_action ./perf report -i perf.data --sort dso,symbol Samples: 89 of event 'skb:kfree_skb', Event count (approx.): 89 + 97.75% [kernel.kallsyms] [k] net_tx_action + 2.25% [kernel.kallsyms] [k] kfree_skb ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-06 9:06 ` Or Gerlitz @ 2013-12-06 13:36 ` Eric Dumazet 2013-12-07 21:20 ` Or Gerlitz 2013-12-08 12:09 ` Or Gerlitz 0 siblings, 2 replies; 63+ messages in thread From: Eric Dumazet @ 2013-12-06 13:36 UTC (permalink / raw) To: Or Gerlitz Cc: Alexei Starovoitov, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org On Fri, 2013-12-06 at 11:06 +0200, Or Gerlitz wrote: > On Wed, Dec 4, 2013 at 3:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > skb->dev is in the first cache line, where we access skb->next anyway. > > I could use skb->cb[] like the following patch : > > Hi Eric, I applied on the net tree the patch you posted yesterday > "net: introduce dev_consume_skb_any()" along with the network drivers > part of this patch, unless I got it wrong, I assume both pieces are > needed? > > So I re-run the vxlan/veth test that we suspect goes through packet drops on TX. > > With the patches applied I have almost no samples of that event > > $ ./perf report -i perf.data How did you get this perf.data file ? There are a few drops. > > Samples: 89 of event 'skb:kfree_skb', Event count (approx.): 89 > + 39.33% ksoftirqd/2 [kernel.kallsyms] [k] net_tx_action > > ? > + 28.09% swapper [kernel.kallsyms] [k] net_tx_action > > ? > + 28.09% sshd [kernel.kallsyms] [k] net_tx_action > > ? > + 2.25% swapper [kernel.kallsyms] [k] kfree_skb > > ? > + 1.12% kworker/2:2 [kernel.kallsyms] [k] net_tx_action > > ? > + 1.12% iperf [kernel.kallsyms] [k] net_tx_action > > ./perf report -i perf.data --sort dso,symbol > Samples: 89 of event 'skb:kfree_skb', Event count (approx.): 89 > + 97.75% [kernel.kallsyms] [k] net_tx_action > + 2.25% [kernel.kallsyms] [k] kfree_skb > -- OK, this means your driver drops few packets in its ndo_start_xmit() handler. Could you give us "ifconfig -a" reports as I already asked ? You could temporary change the dev_kfree_skb_any() in mlx4_en_xmit() to call kfree_skb(skb) instead, to get a stack trace (perf record -a -g -e skb:kfree_skb sleep 20 ; perf report) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index f54ebd5a1702..53130f27dec0 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -873,7 +873,7 @@ tx_drop_unmap: } tx_drop: - dev_kfree_skb_any(skb); + kfree_skb(skb); priv->stats.tx_dropped++; return NETDEV_TX_OK; } ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-06 13:36 ` Eric Dumazet @ 2013-12-07 21:20 ` Or Gerlitz 2013-12-08 12:09 ` Or Gerlitz 1 sibling, 0 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-07 21:20 UTC (permalink / raw) To: Eric Dumazet Cc: Alexei Starovoitov, David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org On Fri, Dec 6, 2013 at 3:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Fri, 2013-12-06 at 11:06 +0200, Or Gerlitz wrote: >> On Wed, Dec 4, 2013 at 3:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > skb->dev is in the first cache line, where we access skb->next anyway. >> > I could use skb->cb[] like the following patch : >> >> Hi Eric, I applied on the net tree the patch you posted yesterday >> "net: introduce dev_consume_skb_any()" along with the network drivers >> part of this patch, unless I got it wrong, I assume both pieces are >> needed? >> >> So I re-run the vxlan/veth test that we suspect goes through packet drops on TX. >> >> With the patches applied I have almost no samples of that event >> >> $ ./perf report -i perf.data > > How did you get this perf.data file ? There are a few drops. Using the command you suggested on the active side while running traffic (iperf tcp) $ ./perf record -e skb:kfree_skb -g -a sleep 10 >> Samples: 89 of event 'skb:kfree_skb', Event count (approx.): 89 >> + 39.33% ksoftirqd/2 [kernel.kallsyms] [k] net_tx_action >> + 28.09% swapper [kernel.kallsyms] [k] net_tx_action >> + 28.09% sshd [kernel.kallsyms] [k] net_tx_action >> + 2.25% swapper [kernel.kallsyms] [k] kfree_skb >> + 1.12% kworker/2:2 [kernel.kallsyms] [k] net_tx_action >> + 1.12% iperf [kernel.kallsyms] [k] net_tx_action >> ./perf report -i perf.data --sort dso,symbol >> Samples: 89 of event 'skb:kfree_skb', Event count (approx.): 89 >> + 97.75% [kernel.kallsyms] [k] net_tx_action >> + 2.25% [kernel.kallsyms] [k] kfree_skb >> -- > OK, this means your driver drops few packets in its ndo_start_xmit() > handler. > > Could you give us "ifconfig -a" reports as I already asked ? will do tomorrow while infront of the setup. When I provided you the info last time http://marc.info/?l=linux-netdev&m=138610891121531&w=2 it included the ifconfig -a output and no drops were seen there > > You could temporary change the dev_kfree_skb_any() in mlx4_en_xmit() > to call kfree_skb(skb) instead, to get a stack trace (perf record -a -g > -e skb:kfree_skb sleep 20 ; perf report) yes, tomorrow > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > index f54ebd5a1702..53130f27dec0 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > @@ -873,7 +873,7 @@ tx_drop_unmap: > } > > tx_drop: > - dev_kfree_skb_any(skb); > + kfree_skb(skb); > priv->stats.tx_dropped++; > return NETDEV_TX_OK; > } > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-06 13:36 ` Eric Dumazet 2013-12-07 21:20 ` Or Gerlitz @ 2013-12-08 12:09 ` Or Gerlitz 1 sibling, 0 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-08 12:09 UTC (permalink / raw) To: Eric Dumazet, Or Gerlitz Cc: Alexei Starovoitov, David Miller, Joseph Gasparakis, Jerry Chu, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org On 06/12/2013 15:36, Eric Dumazet wrote: > On Fri, 2013-12-06 at 11:06 +0200, Or Gerlitz wrote: >> On Wed, Dec 4, 2013 at 3:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >>> skb->dev is in the first cache line, where we access skb->next anyway. >>> I could use skb->cb[] like the following patch : >> Hi Eric, I applied on the net tree the patch you posted yesterday >> "net: introduce dev_consume_skb_any()" along with the network drivers >> part of this patch, unless I got it wrong, I assume both pieces are >> needed? >> >> So I re-run the vxlan/veth test that we suspect goes through packet drops on TX. >> >> With the patches applied I have almost no samples of that event >> >> $ ./perf report -i perf.data > How did you get this perf.data file ? There are a few drops. > >> Samples: 89 of event 'skb:kfree_skb', Event count (approx.): 89 >> + 39.33% ksoftirqd/2 [kernel.kallsyms] [k] net_tx_action >> + 28.09% swapper [kernel.kallsyms] [k] net_tx_action >> + 28.09% sshd [kernel.kallsyms] [k] net_tx_action >> + 2.25% swapper [kernel.kallsyms] [k] kfree_skb >> + 1.12% kworker/2:2 [kernel.kallsyms] [k] net_tx_action >> + 1.12% iperf [kernel.kallsyms] [k] net_tx_action >> >> ./perf report -i perf.data --sort dso,symbol >> Samples: 89 of event 'skb:kfree_skb', Event count (approx.): 89 >> + 97.75% [kernel.kallsyms] [k] net_tx_action >> + 2.25% [kernel.kallsyms] [k] kfree_skb >> -- > OK, this means your driver drops few packets in its ndo_start_xmit() handler. I wasn't sure to follow how the above lead you to think the drops occur at the driver -- but, anyway, I applied your other patches and the one below which made the mlx4 driver to make a kfree_skb(skb) call - and don't see mlx4 hits in either the client or server side client side: Samples: 133 of event 'skb:kfree_skb', Event count (approx.): 133 + 40.60% ksoftirqd/2 [kernel.kallsyms] [k] net_tx_action + 25.56% iperf [kernel.kallsyms] [k] net_tx_action + 24.06% swapper [kernel.kallsyms] [k] net_tx_action + 3.01% iperf [kernel.kallsyms] [k] kfree_skb + 2.26% swapper [kernel.kallsyms] [k] kfree_skb + 2.26% kworker/2:1 [kernel.kallsyms] [k] net_tx_action + 0.75% rcuc/2 [kernel.kallsyms] [k] net_tx_action + 0.75% kworker/2:2 [kernel.kallsyms] [k] net_tx_action + 0.75% ypbind [kernel.kallsyms] [k] net_tx_action server side: Samples: 57 of event 'skb:kfree_skb', Event count (approx.): 57 + 47.37% swapper [kernel.kallsyms] [k] kfree_skb + 22.81% iperf [kernel.kallsyms] [k] kfree_skb + 8.77% ksoftirqd/2 [kernel.kallsyms] [k] kfree_skb + 7.02% hald-addon-acpi [kernel.kallsyms] [k] kfree_skb + 7.02% ls [kernel.kallsyms] [k] kfree_skb + 3.51% umount.nfs [kernel.kallsyms] [k] kfree_skb + 1.75% udevd [kernel.kallsyms] [k] kfree_skb + 1.75% rpc.idmapd [kernel.kallsyms] [k] kfree_skb I will provide you OOB the full perf report files, but I have scrolled into the hits and didn't see one in mlx4... any idea where/how to take this from here? > > Could you give us "ifconfig -a" reports as I already asked ? sure, I see on both sides there are some drops on 1Gbs NIC which is not part of the test client side (mlx4 NIC is eth6) r-dcs44-005 perf]# ifconfig -a br1 Link encap:Ethernet HWaddr 0E:08:CC:54:78:44 inet addr:192.168.52.144 Bcast:192.168.52.255 Mask:255.255.255.0 inet6 addr: fe80::c0c0:45ff:feff:bfed/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:11 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:732 (732.0 b) TX bytes:648 (648.0 b) eth0 Link encap:Ethernet HWaddr 00:50:56:25:4B:05 inet addr:10.212.75.5 Bcast:10.212.255.255 Mask:255.255.0.0 inet6 addr: fe80::250:56ff:fe25:4b05/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:70485 errors:0 dropped:467 overruns:0 frame:0 TX packets:35821 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:58838076 (56.1 MiB) TX bytes:20632115 (19.6 MiB) eth6 Link encap:Ethernet HWaddr 00:02:C9:E9:BB:B2 inet addr:192.168.30.144 Bcast:192.168.30.255 Mask:255.255.255.0 inet6 addr: fe80::2:c900:1e9:bbb2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:58503788 errors:0 dropped:0 overruns:0 frame:0 TX packets:389819285 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:6604815254 (6.1 GiB) TX bytes:589822872168 (549.3 GiB) eth7 Link encap:Ethernet HWaddr 52:54:00:86:B6:48 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:137 errors:0 dropped:0 overruns:0 frame:0 TX packets:137 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:13860 (13.5 KiB) TX bytes:13860 (13.5 KiB) veth0 Link encap:Ethernet HWaddr E6:95:68:49:A6:3D inet addr:192.168.62.144 Bcast:192.168.62.255 Mask:255.255.255.0 inet6 addr: fe80::e495:68ff:fe49:a63d/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:58510472 errors:0 dropped:0 overruns:0 frame:0 TX packets:55440581 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3680050172 (3.4 GiB) TX bytes:552349060836 (514.4 GiB) veth1 Link encap:Ethernet HWaddr 5A:4D:A3:4B:B1:97 inet6 addr: fe80::584d:a3ff:fe4b:b197/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:55440581 errors:0 dropped:0 overruns:0 frame:0 TX packets:58510475 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:552349060836 (514.4 GiB) TX bytes:3680050334 (3.4 GiB) vxlan42 Link encap:Ethernet HWaddr 0E:08:CC:54:78:44 inet addr:192.168.42.144 Bcast:192.168.42.255 Mask:255.255.255.0 inet6 addr: fe80::c08:ccff:fe54:7844/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:58510461 errors:0 dropped:0 overruns:0 frame:0 TX packets:55440599 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2860902728 (2.6 GiB) TX bytes:553236159764 (515.2 GiB) server side (mlx4 NIC is eth2) r-dcs47-005 perf]# ifconfig -a br1 Link encap:Ethernet HWaddr 2A:9B:C5:5F:FA:AB inet addr:192.168.52.147 Bcast:192.168.52.255 Mask:255.255.255.0 inet6 addr: fe80::cca:f9ff:fead:4210/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:33 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2172 (2.1 KiB) TX bytes:648 (648.0 b) eth0 Link encap:Ethernet HWaddr 00:50:56:25:4A:05 inet addr:10.212.74.5 Bcast:10.212.255.255 Mask:255.255.0.0 inet6 addr: fe80::250:56ff:fe25:4a05/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:15274 errors:0 dropped:18 overruns:0 frame:0 TX packets:9190 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:36265311 (34.5 MiB) TX bytes:5736966 (5.4 MiB) eth2 Link encap:Ethernet HWaddr 00:02:C9:E9:C0:82 inet addr:192.168.30.147 Bcast:192.168.30.255 Mask:255.255.255.0 inet6 addr: fe80::2:c900:1e9:c082/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:442423939 errors:0 dropped:0 overruns:0 frame:0 TX packets:66524628 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:669425541728 (623.4 GiB) TX bytes:7511309282 (6.9 GiB) eth3 Link encap:Ethernet HWaddr 52:54:00:5D:70:D9 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:32 errors:0 dropped:0 overruns:0 frame:0 TX packets:32 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2196 (2.1 KiB) TX bytes:2196 (2.1 KiB) veth0 Link encap:Ethernet HWaddr 56:3D:34:30:86:68 inet addr:192.168.62.147 Bcast:192.168.62.255 Mask:255.255.255.0 inet6 addr: fe80::543d:34ff:fe30:8668/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:442491685 errors:0 dropped:0 overruns:0 frame:0 TX packets:66538057 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:647403574534 (602.9 GiB) TX bytes:4185942998 (3.8 GiB) veth1 Link encap:Ethernet HWaddr 2A:9B:C5:5F:FA:AB inet6 addr: fe80::289b:c5ff:fe5f:faab/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:66538066 errors:0 dropped:0 overruns:0 frame:0 TX packets:442491738 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:4185943484 (3.8 GiB) TX bytes:647403652126 (602.9 GiB) vxlan42 Link encap:Ethernet HWaddr F6:E2:99:BD:D6:58 inet addr:192.168.42.147 Bcast:192.168.42.255 Mask:255.255.255.0 inet6 addr: fe80::f4e2:99ff:febd:d658/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:442491767 errors:0 dropped:0 overruns:0 frame:0 TX packets:66538082 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:641208829864 (597.1 GiB) TX bytes:5250554092 (4.8 GiB) > > You could temporary change the dev_kfree_skb_any() in mlx4_en_xmit() > to call kfree_skb(skb) instead, to get a stack trace (perf record -a -g > -e skb:kfree_skb sleep 20 ; perf report) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > index f54ebd5a1702..53130f27dec0 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > @@ -873,7 +873,7 @@ tx_drop_unmap: > } > > tx_drop: > - dev_kfree_skb_any(skb); > + kfree_skb(skb); > priv->stats.tx_dropped++; > return NETDEV_TX_OK; > } > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 23:59 ` Eric Dumazet 2013-12-04 0:26 ` Alexei Starovoitov @ 2013-12-04 6:39 ` David Miller 2013-12-04 17:40 ` Eric Dumazet 2013-12-05 12:45 ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet 2 siblings, 1 reply; 63+ messages in thread From: David Miller @ 2013-12-04 6:39 UTC (permalink / raw) To: eric.dumazet Cc: or.gerlitz, joseph.gasparakis, hkchu, ogerlitz, edumazet, ast, pshelar, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 03 Dec 2013 15:59:59 -0800 > On Wed, 2013-12-04 at 01:10 +0200, Or Gerlitz wrote: > >> Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406 >> + 97.13% swapper [kernel.kallsyms] [k] net_tx_action >> + 1.53% iperf [kernel.kallsyms] [k] net_tx_action >> + 1.03% perf [kernel.kallsyms] [k] net_tx_action >> + 0.27% ksoftirqd/7 [kernel.kallsyms] [k] net_tx_action >> + 0.03% kworker/7:1 [kernel.kallsyms] [k] net_tx_action >> + 0.00% rpcbind [kernel.kallsyms] [k] net_tx_action >> + 0.00% swapper [kernel.kallsyms] [k] kfree_skb >> + 0.00% sleep [kernel.kallsyms] [k] net_tx_action >> + 0.00% hald-addon-acpi [kernel.kallsyms] [k] kfree_skb >> + 0.00% iperf [kernel.kallsyms] [k] kfree_skb >> + 0.00% perf [kernel.kallsyms] [k] kfree_skb >> > > Right, I actually have a patch for that, but was waiting for net-next > being re-opened : > > commit 9a731d750dd8bf0b8c20fb1ca53c42317fb4dd37 > Author: Eric Dumazet <edumazet@google.com> > Date: Mon Nov 25 13:09:20 2013 -0800 > > net-fixes: introduce dev_consume_skb_any() I definitely prefer the control block approach to this. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 6:39 ` David Miller @ 2013-12-04 17:40 ` Eric Dumazet 0 siblings, 0 replies; 63+ messages in thread From: Eric Dumazet @ 2013-12-04 17:40 UTC (permalink / raw) To: David Miller Cc: or.gerlitz, joseph.gasparakis, hkchu, ogerlitz, edumazet, ast, pshelar, netdev On Wed, 2013-12-04 at 01:39 -0500, David Miller wrote: > I definitely prefer the control block approach to this. I polished the patch to keep this knowledge in net/core/dev.c ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH net-next] net: introduce dev_consume_skb_any() 2013-12-03 23:59 ` Eric Dumazet 2013-12-04 0:26 ` Alexei Starovoitov 2013-12-04 6:39 ` David Miller @ 2013-12-05 12:45 ` Eric Dumazet 2013-12-05 14:13 ` Hannes Frederic Sowa 2013-12-06 20:24 ` David Miller 2 siblings, 2 replies; 63+ messages in thread From: Eric Dumazet @ 2013-12-05 12:45 UTC (permalink / raw) To: David Miller; +Cc: netdev From: Eric Dumazet <edumazet@google.com> Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq() helpers to free skbs, both for dropped packets and TX completed ones. We need to separate the two causes to get better diagnostics given by dropwatch or "perf record -e skb:kfree_skb" This patch provides two new helpers, dev_consume_skb_any() and dev_consume_skb_irq() to be used for consumed skbs. __dev_kfree_skb_irq() is slightly optimized to remove one atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors. Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/linux/netdevice.h | 53 +++++++++++++++++++++++++++++------- net/core/dev.c | 45 ++++++++++++++++++++---------- 2 files changed, 74 insertions(+), 24 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7f0ed423a360..c6d64d20050c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2368,17 +2368,52 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, #define DEFAULT_MAX_NUM_RSS_QUEUES (8) int netif_get_num_default_rss_queues(void); -/* Use this variant when it is known for sure that it - * is executing from hardware interrupt context or with hardware interrupts - * disabled. - */ -void dev_kfree_skb_irq(struct sk_buff *skb); +enum skb_free_reason { + SKB_REASON_CONSUMED, + SKB_REASON_DROPPED, +}; + +void __dev_kfree_skb_irq(struct sk_buff *skb, enum skb_free_reason reason); +void __dev_kfree_skb_any(struct sk_buff *skb, enum skb_free_reason reason); -/* Use this variant in places where it could be invoked - * from either hardware interrupt or other context, with hardware interrupts - * either disabled or enabled. +/* + * It is not allowed to call kfree_skb() or consume_skb() from hardware + * interrupt context or with hardware interrupts being disabled. + * (in_irq() || irqs_disabled()) + * + * We provide four helpers that can be used in following contexts : + * + * dev_kfree_skb_irq(skb) when caller drops a packet from irq context, + * replacing kfree_skb(skb) + * + * dev_consume_skb_irq(skb) when caller consumes a packet from irq context. + * Typically used in place of consume_skb(skb) in TX completion path + * + * dev_kfree_skb_any(skb) when caller doesn't know its current irq context, + * replacing kfree_skb(skb) + * + * dev_consume_skb_any(skb) when caller doesn't know its current irq context, + * and consumed a packet. Used in place of consume_skb(skb) */ -void dev_kfree_skb_any(struct sk_buff *skb); +static inline void dev_kfree_skb_irq(struct sk_buff *skb) +{ + __dev_kfree_skb_irq(skb, SKB_REASON_DROPPED); +} + +static inline void dev_consume_skb_irq(struct sk_buff *skb) +{ + __dev_kfree_skb_irq(skb, SKB_REASON_CONSUMED); +} + +static inline void dev_kfree_skb_any(struct sk_buff *skb) +{ + __dev_kfree_skb_any(skb, SKB_REASON_DROPPED); +} + +static inline void dev_consume_skb_any(struct sk_buff *skb) +{ + __dev_kfree_skb_any(skb, SKB_REASON_CONSUMED); +} int netif_rx(struct sk_buff *skb); int netif_rx_ni(struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index ba3b7ea5ebb3..aa54a742f392 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2145,30 +2145,42 @@ void __netif_schedule(struct Qdisc *q) } EXPORT_SYMBOL(__netif_schedule); -void dev_kfree_skb_irq(struct sk_buff *skb) +struct dev_kfree_skb_cb { + enum skb_free_reason reason; +}; + +static struct dev_kfree_skb_cb *get_kfree_skb_cb(const struct sk_buff *skb) +{ + return (struct dev_kfree_skb_cb *)skb->cb; +} + +void __dev_kfree_skb_irq(struct sk_buff *skb, enum skb_free_reason reason) { - if (atomic_dec_and_test(&skb->users)) { - struct softnet_data *sd; - unsigned long flags; + unsigned long flags; - local_irq_save(flags); - sd = &__get_cpu_var(softnet_data); - skb->next = sd->completion_queue; - sd->completion_queue = skb; - raise_softirq_irqoff(NET_TX_SOFTIRQ); - local_irq_restore(flags); + if (likely(atomic_read(&skb->users) == 1)) { + smp_rmb(); + atomic_set(&skb->users, 0); + } else if (likely(!atomic_dec_and_test(&skb->users))) { + return; } + get_kfree_skb_cb(skb)->reason = reason; + local_irq_save(flags); + skb->next = __this_cpu_read(softnet_data.completion_queue); + __this_cpu_write(softnet_data.completion_queue, skb); + raise_softirq_irqoff(NET_TX_SOFTIRQ); + local_irq_restore(flags); } -EXPORT_SYMBOL(dev_kfree_skb_irq); +EXPORT_SYMBOL(__dev_kfree_skb_irq); -void dev_kfree_skb_any(struct sk_buff *skb) +void __dev_kfree_skb_any(struct sk_buff *skb, enum skb_free_reason reason) { if (in_irq() || irqs_disabled()) - dev_kfree_skb_irq(skb); + __dev_kfree_skb_irq(skb, reason); else dev_kfree_skb(skb); } -EXPORT_SYMBOL(dev_kfree_skb_any); +EXPORT_SYMBOL(__dev_kfree_skb_any); /** @@ -3306,7 +3318,10 @@ static void net_tx_action(struct softirq_action *h) clist = clist->next; WARN_ON(atomic_read(&skb->users)); - trace_kfree_skb(skb, net_tx_action); + if (likely(get_kfree_skb_cb(skb)->reason == SKB_REASON_CONSUMED)) + trace_consume_skb(skb); + else + trace_kfree_skb(skb, net_tx_action); __kfree_skb(skb); } } ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [PATCH net-next] net: introduce dev_consume_skb_any() 2013-12-05 12:45 ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet @ 2013-12-05 14:13 ` Hannes Frederic Sowa 2013-12-05 14:45 ` Eric Dumazet 2013-12-06 20:24 ` David Miller 1 sibling, 1 reply; 63+ messages in thread From: Hannes Frederic Sowa @ 2013-12-05 14:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Thu, Dec 05, 2013 at 04:45:08AM -0800, Eric Dumazet wrote: > - local_irq_save(flags); > - sd = &__get_cpu_var(softnet_data); > - skb->next = sd->completion_queue; > - sd->completion_queue = skb; > - raise_softirq_irqoff(NET_TX_SOFTIRQ); > - local_irq_restore(flags); > + if (likely(atomic_read(&skb->users) == 1)) { > + smp_rmb(); Could you give me a hint why this barrier is needed? IMHO the volatile access in atomic_read should get rid of the control dependency so I don't see a need for this barrier. Without the volatile access a compiler-barrier would still suffice, I guess? > + atomic_set(&skb->users, 0); > + } else if (likely(!atomic_dec_and_test(&skb->users))) { > + return; Or does this memory barrier deal with the part below this return? Thanks, Hannes ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH net-next] net: introduce dev_consume_skb_any() 2013-12-05 14:13 ` Hannes Frederic Sowa @ 2013-12-05 14:45 ` Eric Dumazet 2013-12-05 15:05 ` Eric Dumazet 0 siblings, 1 reply; 63+ messages in thread From: Eric Dumazet @ 2013-12-05 14:45 UTC (permalink / raw) To: Hannes Frederic Sowa; +Cc: David Miller, netdev On Thu, 2013-12-05 at 15:13 +0100, Hannes Frederic Sowa wrote: > On Thu, Dec 05, 2013 at 04:45:08AM -0800, Eric Dumazet wrote: > > - local_irq_save(flags); > > - sd = &__get_cpu_var(softnet_data); > > - skb->next = sd->completion_queue; > > - sd->completion_queue = skb; > > - raise_softirq_irqoff(NET_TX_SOFTIRQ); > > - local_irq_restore(flags); > > + if (likely(atomic_read(&skb->users) == 1)) { > > + smp_rmb(); > > Could you give me a hint why this barrier is needed? IMHO the volatile > access in atomic_read should get rid of the control dependency so I > don't see a need for this barrier. Without the volatile access a > compiler-barrier would still suffice, I guess? Please take a look at kfree_skb() implementation. If you think a comment is needed there, please feel free to add it. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH net-next] net: introduce dev_consume_skb_any() 2013-12-05 14:45 ` Eric Dumazet @ 2013-12-05 15:05 ` Eric Dumazet 2013-12-05 15:44 ` Hannes Frederic Sowa 0 siblings, 1 reply; 63+ messages in thread From: Eric Dumazet @ 2013-12-05 15:05 UTC (permalink / raw) To: Hannes Frederic Sowa; +Cc: David Miller, netdev On Thu, 2013-12-05 at 06:45 -0800, Eric Dumazet wrote: > On Thu, 2013-12-05 at 15:13 +0100, Hannes Frederic Sowa wrote: > > On Thu, Dec 05, 2013 at 04:45:08AM -0800, Eric Dumazet wrote: > > > - local_irq_save(flags); > > > - sd = &__get_cpu_var(softnet_data); > > > - skb->next = sd->completion_queue; > > > - sd->completion_queue = skb; > > > - raise_softirq_irqoff(NET_TX_SOFTIRQ); > > > - local_irq_restore(flags); > > > + if (likely(atomic_read(&skb->users) == 1)) { > > > + smp_rmb(); > > > > Could you give me a hint why this barrier is needed? IMHO the volatile > > access in atomic_read should get rid of the control dependency so I > > don't see a need for this barrier. Without the volatile access a > > compiler-barrier would still suffice, I guess? > > Please take a look at kfree_skb() implementation. > > If you think a comment is needed there, please feel free to add it. > My understanding of this (old) barrier here is an implicit wmb in skb_get() This probably needs something like : static inline struct sk_buff *skb_get(struct sk_buff *skb) { smp_mb__before_atomic_inc(); /* check {consume|kfree}_skb() */ atomic_inc(&skb->users); } ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH net-next] net: introduce dev_consume_skb_any() 2013-12-05 15:05 ` Eric Dumazet @ 2013-12-05 15:44 ` Hannes Frederic Sowa 2013-12-05 16:38 ` Eric Dumazet 0 siblings, 1 reply; 63+ messages in thread From: Hannes Frederic Sowa @ 2013-12-05 15:44 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Thu, Dec 05, 2013 at 07:05:52AM -0800, Eric Dumazet wrote: > On Thu, 2013-12-05 at 06:45 -0800, Eric Dumazet wrote: > > On Thu, 2013-12-05 at 15:13 +0100, Hannes Frederic Sowa wrote: > > > On Thu, Dec 05, 2013 at 04:45:08AM -0800, Eric Dumazet wrote: > > > > - local_irq_save(flags); > > > > - sd = &__get_cpu_var(softnet_data); > > > > - skb->next = sd->completion_queue; > > > > - sd->completion_queue = skb; > > > > - raise_softirq_irqoff(NET_TX_SOFTIRQ); > > > > - local_irq_restore(flags); > > > > + if (likely(atomic_read(&skb->users) == 1)) { > > > > + smp_rmb(); > > > > > > Could you give me a hint why this barrier is needed? IMHO the volatile > > > access in atomic_read should get rid of the control dependency so I > > > don't see a need for this barrier. Without the volatile access a > > > compiler-barrier would still suffice, I guess? > > > > Please take a look at kfree_skb() implementation. > > > > If you think a comment is needed there, please feel free to add it. > > > > My understanding of this (old) barrier here is an implicit wmb in > skb_get() > > This probably needs something like : > > static inline struct sk_buff *skb_get(struct sk_buff *skb) > { > smp_mb__before_atomic_inc(); /* check {consume|kfree}_skb() */ > atomic_inc(&skb->users); > } Thanks for the pointer to kfree_skb. I found this commit which added the barrier in kfree_skb (from history.git): commit 09d3e84de438f217510b604a980befd07b0c8262 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Sat Feb 5 03:23:27 2005 -0800 [NET]: Add missing memory barrier to kfree_skb(). Also kill kfree_skb_fast(), that is a relic from fast switching which was killed off years ago. The bug is that in the case where we do the atomic_read() optimization, we need to make sure that reads of skb state later in __kfree_skb() processing (particularly the skb->list BUG check) are not reordered to occur before the counter read by the cpu. Thanks to Olaf Kirch and Anton Blanchard for discovering and helping fix this bug. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> It makes some sense but I did not grasp the whole ->users dependency picture, yet. I guess the barrier is only needed when refcount drops down to 0 and we don't necessarily need one when incrementing ->users. Thank you, Hannes ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH net-next] net: introduce dev_consume_skb_any() 2013-12-05 15:44 ` Hannes Frederic Sowa @ 2013-12-05 16:38 ` Eric Dumazet 2013-12-05 16:54 ` Hannes Frederic Sowa 0 siblings, 1 reply; 63+ messages in thread From: Eric Dumazet @ 2013-12-05 16:38 UTC (permalink / raw) To: Hannes Frederic Sowa; +Cc: David Miller, netdev On Thu, 2013-12-05 at 16:44 +0100, Hannes Frederic Sowa wrote: > It makes some sense but I did not grasp the whole ->users dependency > picture, yet. I guess the barrier is only needed when refcount drops > down to 0 and we don't necessarily need one when incrementing ->users. If you are the only user of this skb, really no smp barrier is needed at all. The problem comes when another cpu is working on the skb, and finally releases its reference on it. Before releasing its reference, it must commit all changes it might have done onto skb. Otherwise another cpu might read stale data. The smp_wmb() is done by the atomic_dec_and_test(), as it contains a full barrier. So the smp_rmb() pairs with the barrier done in atomic_dec_and_test() ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH net-next] net: introduce dev_consume_skb_any() 2013-12-05 16:38 ` Eric Dumazet @ 2013-12-05 16:54 ` Hannes Frederic Sowa 0 siblings, 0 replies; 63+ messages in thread From: Hannes Frederic Sowa @ 2013-12-05 16:54 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Thu, Dec 05, 2013 at 08:38:05AM -0800, Eric Dumazet wrote: > On Thu, 2013-12-05 at 16:44 +0100, Hannes Frederic Sowa wrote: > > > It makes some sense but I did not grasp the whole ->users dependency > > picture, yet. I guess the barrier is only needed when refcount drops > > down to 0 and we don't necessarily need one when incrementing ->users. > > If you are the only user of this skb, really no smp barrier is needed at > all. > > The problem comes when another cpu is working on the skb, and finally > releases its reference on it. > > Before releasing its reference, it must commit all changes it might have > done onto skb. Otherwise another cpu might read stale data. > > The smp_wmb() is done by the atomic_dec_and_test(), as it contains a > full barrier. > > So the smp_rmb() pairs with the barrier done in atomic_dec_and_test() Ha, it all makes sense now! Thanks, Eric! (Sorry for the noise but I find this kind of problems very interesting) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH net-next] net: introduce dev_consume_skb_any() 2013-12-05 12:45 ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet 2013-12-05 14:13 ` Hannes Frederic Sowa @ 2013-12-06 20:24 ` David Miller 1 sibling, 0 replies; 63+ messages in thread From: David Miller @ 2013-12-06 20:24 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 05 Dec 2013 04:45:08 -0800 > From: Eric Dumazet <edumazet@google.com> > > Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq() > helpers to free skbs, both for dropped packets and TX completed ones. > > We need to separate the two causes to get better diagnostics > given by dropwatch or "perf record -e skb:kfree_skb" > > This patch provides two new helpers, dev_consume_skb_any() and > dev_consume_skb_irq() to be used for consumed skbs. > > __dev_kfree_skb_irq() is slightly optimized to remove one > atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Applied, thanks Eric. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 21:09 ` Or Gerlitz 2013-12-03 21:24 ` Eric Dumazet @ 2013-12-03 23:13 ` Joseph Gasparakis 2013-12-03 23:09 ` Or Gerlitz 1 sibling, 1 reply; 63+ messages in thread From: Joseph Gasparakis @ 2013-12-03 23:13 UTC (permalink / raw) To: Or Gerlitz Cc: Gasparakis, Joseph, Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, 3 Dec 2013, Or Gerlitz wrote: > On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis > <joseph.gasparakis@intel.com> wrote: > > >>> lack of GRO : receiver seems to not be able to receive as fast as you want. > >>>> TCPOFOQueue: 3167879 > >>> So many packets are received out of order (because of losses) > > >> I see that there's no GRO also for the non-veth tests which involve > >> vxlan, and over there the receiving side is capable to consume the > >> packets, do you have rough explaination why adding veth to the chain > >> is such game changer which makes things to start falling out? > > > I have seen this before. Here are my findings: > > > > The gso_type is different if the skb comes from veth or not. From veth, > > you will see the SKB_GSO_DODGY set. This breaks things as when the > > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard, > > the stack drops it silently. I never got the time to find the root cause > > for this, but I know it causes re-transmissions and big performance > > degregation. > > > > I went as far as just quickly hacking a one liner unsetting the DODGY bit > > in vxlan.c and that bypassed the issue and recovered the performance > > problem, but obviously this is not a real fix. > > thanks for the heads up, few quick questions/clafications -- > > -- you are talking on drops done @ the sender side, correct? Eric was > saying we have evidences that the drops happen on the receiver. I am *guessing* drops on the Rx are due to the drops at the Tx. See my answer to your next question for more info. > > -- without the hack you did, still packets are sent/received, so what > makes the stack to drop only some of them? > What I had seen is GSOs getting dropped on the Tx side. Basically the GSOs never made it to the driver, they were broken into non GSO smaller skbs by the stack. I think the stack is not handling well the GSO with the DODGY bit set, and that causes it to maybe partially the packet to be emitted, causing the re-transmits (and maybe the drops on your Rx end)? Of course all this is speculation, the fact that I know is that as soon as I was forcing the gso type I saw offloaded VXLAN encapsulated traffic at decent speeds. > -- why packets coming from veth would have the SKB_GSO_DODGY bit set? That is something I would love to know too. I am guessing this is a way for the VM to say it is a non-trusted packet? And maybe all this can be fixed by maybe setting something on the VM through a userspace tool that will stop the veth to set the DODGY bit? > > -- so where is now (say net.git or 3.12.x) this one line you commented > out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c > explicit setting of SKB_GSO_DODGY I did not commit it, as this was just a workaround to prove to myself that the problem I was seing was due to the gso_type, and it would actually just hide the problem and not give a proper solution to it. > > Also, I am pretty sure the problem exists also when sending/receiving > guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I > just sticked to the veth flavour b/c its one (== the hypervisor) > network stack to debug and not two (+ the guest one). > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 23:13 ` vxlan/veth performance issues on net.git + latest kernels Joseph Gasparakis @ 2013-12-03 23:09 ` Or Gerlitz 2013-12-04 0:35 ` Joseph Gasparakis 0 siblings, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 23:09 UTC (permalink / raw) To: Joseph Gasparakis Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Wed, Dec 4, 2013 at 1:13 AM, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote: > > > On Tue, 3 Dec 2013, Or Gerlitz wrote: > >> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis >> <joseph.gasparakis@intel.com> wrote: >> >> >>> lack of GRO : receiver seems to not be able to receive as fast as you want. >> >>>> TCPOFOQueue: 3167879 >> >>> So many packets are received out of order (because of losses) >> >> >> I see that there's no GRO also for the non-veth tests which involve >> >> vxlan, and over there the receiving side is capable to consume the >> >> packets, do you have rough explaination why adding veth to the chain >> >> is such game changer which makes things to start falling out? >> >> > I have seen this before. Here are my findings: >> > >> > The gso_type is different if the skb comes from veth or not. From veth, >> > you will see the SKB_GSO_DODGY set. This breaks things as when the >> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard, >> > the stack drops it silently. I never got the time to find the root cause >> > for this, but I know it causes re-transmissions and big performance >> > degregation. >> > >> > I went as far as just quickly hacking a one liner unsetting the DODGY bit >> > in vxlan.c and that bypassed the issue and recovered the performance >> > problem, but obviously this is not a real fix. >> >> thanks for the heads up, few quick questions/clafications -- >> >> -- you are talking on drops done @ the sender side, correct? Eric was >> saying we have evidences that the drops happen on the receiver. > > I am *guessing* drops on the Rx are due to the drops at the Tx. See my > answer to your next question for more info. > >> >> -- without the hack you did, still packets are sent/received, so what >> makes the stack to drop only some of them? >> > > What I had seen is GSOs getting dropped on the Tx side. Basically the GSOs > never made it to the driver, they were broken into non GSO smaller skbs by > the stack. I think the stack is not handling well the GSO with the DODGY > bit set, and that causes it to maybe partially the packet to be emitted, > causing the re-transmits (and maybe the drops on your Rx end)? Of course > all this is speculation, the fact that I know is that as soon as I was > forcing the gso type I saw offloaded VXLAN encapsulated traffic at decent speeds. > >> -- why packets coming from veth would have the SKB_GSO_DODGY bit set? > > That is something I would love to know too. I am guessing this is a way > for the VM to say it is a non-trusted packet? And maybe all this can be > fixed by maybe setting something on the VM through a userspace tool that > will stop the veth to set the DODGY bit? > >> >> -- so where is now (say net.git or 3.12.x) this one line you commented >> out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c >> explicit setting of SKB_GSO_DODGY > > I did not commit it, as this was just a workaround to prove to myself that > the problem I was seing was due to the gso_type, and it would actually > just hide the problem and not give a proper solution to it. > >> >> Also, I am pretty sure the problem exists also when sending/receiving >> guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I >> just sticked to the veth flavour b/c its one (== the hypervisor) >> network stack to debug and not two (+ the guest one). understood, can you point the line/area you hacked, I'd like to try it too and see the impact >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 23:09 ` Or Gerlitz @ 2013-12-04 0:35 ` Joseph Gasparakis 2013-12-04 0:34 ` Alexei Starovoitov ` (2 more replies) 0 siblings, 3 replies; 63+ messages in thread From: Joseph Gasparakis @ 2013-12-04 0:35 UTC (permalink / raw) To: Or Gerlitz Cc: Joseph Gasparakis, Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, 3 Dec 2013, Or Gerlitz wrote: > On Wed, Dec 4, 2013 at 1:13 AM, Joseph Gasparakis > <joseph.gasparakis@intel.com> wrote: > > > > > > On Tue, 3 Dec 2013, Or Gerlitz wrote: > > > >> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis > >> <joseph.gasparakis@intel.com> wrote: > >> > >> >>> lack of GRO : receiver seems to not be able to receive as fast as you want. > >> >>>> TCPOFOQueue: 3167879 > >> >>> So many packets are received out of order (because of losses) > >> > >> >> I see that there's no GRO also for the non-veth tests which involve > >> >> vxlan, and over there the receiving side is capable to consume the > >> >> packets, do you have rough explaination why adding veth to the chain > >> >> is such game changer which makes things to start falling out? > >> > >> > I have seen this before. Here are my findings: > >> > > >> > The gso_type is different if the skb comes from veth or not. From veth, > >> > you will see the SKB_GSO_DODGY set. This breaks things as when the > >> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard, > >> > the stack drops it silently. I never got the time to find the root cause > >> > for this, but I know it causes re-transmissions and big performance > >> > degregation. > >> > > >> > I went as far as just quickly hacking a one liner unsetting the DODGY bit > >> > in vxlan.c and that bypassed the issue and recovered the performance > >> > problem, but obviously this is not a real fix. > >> > >> thanks for the heads up, few quick questions/clafications -- > >> > >> -- you are talking on drops done @ the sender side, correct? Eric was > >> saying we have evidences that the drops happen on the receiver. > > > > I am *guessing* drops on the Rx are due to the drops at the Tx. See my > > answer to your next question for more info. > > > >> > >> -- without the hack you did, still packets are sent/received, so what > >> makes the stack to drop only some of them? > >> > > > > What I had seen is GSOs getting dropped on the Tx side. Basically the GSOs > > never made it to the driver, they were broken into non GSO smaller skbs by > > the stack. I think the stack is not handling well the GSO with the DODGY > > bit set, and that causes it to maybe partially the packet to be emitted, > > causing the re-transmits (and maybe the drops on your Rx end)? Of course > > all this is speculation, the fact that I know is that as soon as I was > > forcing the gso type I saw offloaded VXLAN encapsulated traffic at decent speeds. > > > >> -- why packets coming from veth would have the SKB_GSO_DODGY bit set? > > > > That is something I would love to know too. I am guessing this is a way > > for the VM to say it is a non-trusted packet? And maybe all this can be > > fixed by maybe setting something on the VM through a userspace tool that > > will stop the veth to set the DODGY bit? > > > >> > >> -- so where is now (say net.git or 3.12.x) this one line you commented > >> out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c > >> explicit setting of SKB_GSO_DODGY > > > > I did not commit it, as this was just a workaround to prove to myself that > > the problem I was seing was due to the gso_type, and it would actually > > just hide the problem and not give a proper solution to it. > > > >> > >> Also, I am pretty sure the problem exists also when sending/receiving > >> guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I > >> just sticked to the veth flavour b/c its one (== the hypervisor) > >> network stack to debug and not two (+ the guest one). > > understood, can you point the line/area you hacked, I'd like to try it > too and see the impact I was printing the gso_type in vxlan_xmit_skb(), right before iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the gso_type was different when a VM was involved and when it was not (although I was transmitting exactly the same packet), and then I replaced my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had for non-VM skb> and it all worked. Then I looked into what was different between the two gso_types and the only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM. I am sure I could have been more delicate with the aproach, but hey, it worked for me. I would be curious to see if this is the same issue as mine. It seems like it is. > > >> -- > >> To unsubscribe from this list: send the line "unsubscribe netdev" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 0:35 ` Joseph Gasparakis @ 2013-12-04 0:34 ` Alexei Starovoitov 2013-12-04 1:29 ` Joseph Gasparakis 2013-12-04 0:44 ` Joseph Gasparakis 2013-12-04 8:35 ` Or Gerlitz 2 siblings, 1 reply; 63+ messages in thread From: Alexei Starovoitov @ 2013-12-04 0:34 UTC (permalink / raw) To: Joseph Gasparakis Cc: Or Gerlitz, Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, David Miller, netdev On Tue, Dec 3, 2013 at 4:35 PM, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote: > > I was printing the gso_type in vxlan_xmit_skb(), right before > iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the > gso_type was different when a VM was involved and when it was not > (although I was transmitting exactly the same packet), and then I replaced > my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had > for non-VM skb> and it all worked. > > Then I looked into what was different between the two gso_types and the > only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM. > I am sure I could have been more delicate with the aproach, but hey, it > worked for me. hmm. dodgy should be a normal path from vm. kvm suppose to negotiate vnet_hdr for tap/macvtap and corresponding driver will be remapping virtio_net_gso* flags into skb_gso_* flags plus gso_dodgy. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 0:34 ` Alexei Starovoitov @ 2013-12-04 1:29 ` Joseph Gasparakis 2013-12-04 1:18 ` Eric Dumazet 0 siblings, 1 reply; 63+ messages in thread From: Joseph Gasparakis @ 2013-12-04 1:29 UTC (permalink / raw) To: Alexei Starovoitov Cc: Joseph Gasparakis, Or Gerlitz, Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, David Miller, netdev On Tue, 3 Dec 2013, Alexei Starovoitov wrote: > On Tue, Dec 3, 2013 at 4:35 PM, Joseph Gasparakis > <joseph.gasparakis@intel.com> wrote: > > > > I was printing the gso_type in vxlan_xmit_skb(), right before > > iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the > > gso_type was different when a VM was involved and when it was not > > (although I was transmitting exactly the same packet), and then I replaced > > my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had > > for non-VM skb> and it all worked. > > > > Then I looked into what was different between the two gso_types and the > > only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM. > > I am sure I could have been more delicate with the aproach, but hey, it > > worked for me. > > hmm. dodgy should be a normal path from vm. > kvm suppose to negotiate vnet_hdr for tap/macvtap and corresponding > driver will be remapping virtio_net_gso* flags into skb_gso_* flags > plus gso_dodgy. > Then would the fix be as simple as just unsetting the bit in vxlan? Because I am guessing that there is a bug somewhere where the combination of GSO_UDP_TUNNEL and GSO_DODGY. That would solve my problem, hopefully Or's too if it is the same. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 1:29 ` Joseph Gasparakis @ 2013-12-04 1:18 ` Eric Dumazet 0 siblings, 0 replies; 63+ messages in thread From: Eric Dumazet @ 2013-12-04 1:18 UTC (permalink / raw) To: Joseph Gasparakis Cc: Alexei Starovoitov, Or Gerlitz, Jerry Chu, Or Gerlitz, Eric Dumazet, Pravin B Shelar, David Miller, netdev On Tue, 2013-12-03 at 17:29 -0800, Joseph Gasparakis wrote: > > Then would the fix be as simple as just unsetting the bit in vxlan? > Because I am guessing that there is a bug somewhere where the combination of > GSO_UDP_TUNNEL and GSO_DODGY. That would solve my problem, hopefully Or's > too if it is the same. Not really ;) We need to handle this bit properly, not pretending it is of no use ;) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 0:35 ` Joseph Gasparakis 2013-12-04 0:34 ` Alexei Starovoitov @ 2013-12-04 0:44 ` Joseph Gasparakis 2013-12-04 8:35 ` Or Gerlitz 2 siblings, 0 replies; 63+ messages in thread From: Joseph Gasparakis @ 2013-12-04 0:44 UTC (permalink / raw) To: Joseph Gasparakis Cc: Or Gerlitz, Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, 3 Dec 2013, Joseph Gasparakis wrote: > > > On Tue, 3 Dec 2013, Or Gerlitz wrote: > > > On Wed, Dec 4, 2013 at 1:13 AM, Joseph Gasparakis > > <joseph.gasparakis@intel.com> wrote: > > > > > > > > > On Tue, 3 Dec 2013, Or Gerlitz wrote: > > > > > >> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis > > >> <joseph.gasparakis@intel.com> wrote: > > >> > > >> >>> lack of GRO : receiver seems to not be able to receive as fast as you want. > > >> >>>> TCPOFOQueue: 3167879 > > >> >>> So many packets are received out of order (because of losses) > > >> > > >> >> I see that there's no GRO also for the non-veth tests which involve > > >> >> vxlan, and over there the receiving side is capable to consume the > > >> >> packets, do you have rough explaination why adding veth to the chain > > >> >> is such game changer which makes things to start falling out? > > >> > > >> > I have seen this before. Here are my findings: > > >> > > > >> > The gso_type is different if the skb comes from veth or not. From veth, > > >> > you will see the SKB_GSO_DODGY set. This breaks things as when the > > >> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard, > > >> > the stack drops it silently. I never got the time to find the root cause > > >> > for this, but I know it causes re-transmissions and big performance > > >> > degregation. > > >> > > > >> > I went as far as just quickly hacking a one liner unsetting the DODGY bit > > >> > in vxlan.c and that bypassed the issue and recovered the performance > > >> > problem, but obviously this is not a real fix. > > >> > > >> thanks for the heads up, few quick questions/clafications -- > > >> > > >> -- you are talking on drops done @ the sender side, correct? Eric was > > >> saying we have evidences that the drops happen on the receiver. > > > > > > I am *guessing* drops on the Rx are due to the drops at the Tx. See my > > > answer to your next question for more info. > > > > > >> > > >> -- without the hack you did, still packets are sent/received, so what > > >> makes the stack to drop only some of them? > > >> > > > > > > What I had seen is GSOs getting dropped on the Tx side. Basically the GSOs > > > never made it to the driver, they were broken into non GSO smaller skbs by > > > the stack. I think the stack is not handling well the GSO with the DODGY > > > bit set, and that causes it to maybe partially the packet to be emitted, > > > causing the re-transmits (and maybe the drops on your Rx end)? Of course > > > all this is speculation, the fact that I know is that as soon as I was > > > forcing the gso type I saw offloaded VXLAN encapsulated traffic at decent speeds. > > > > > >> -- why packets coming from veth would have the SKB_GSO_DODGY bit set? > > > > > > That is something I would love to know too. I am guessing this is a way > > > for the VM to say it is a non-trusted packet? And maybe all this can be > > > fixed by maybe setting something on the VM through a userspace tool that > > > will stop the veth to set the DODGY bit? > > > > > >> > > >> -- so where is now (say net.git or 3.12.x) this one line you commented > > >> out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c > > >> explicit setting of SKB_GSO_DODGY > > > > > > I did not commit it, as this was just a workaround to prove to myself that > > > the problem I was seing was due to the gso_type, and it would actually > > > just hide the problem and not give a proper solution to it. > > > > > >> > > >> Also, I am pretty sure the problem exists also when sending/receiving > > >> guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I > > >> just sticked to the veth flavour b/c its one (== the hypervisor) > > >> network stack to debug and not two (+ the guest one). > > > > understood, can you point the line/area you hacked, I'd like to try it > > too and see the impact > > I was printing the gso_type in vxlan_xmit_skb(), right before > iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the > gso_type was different when a VM was involved and when it was not > (although I was transmitting exactly the same packet), and then I replaced > my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had > for non-VM skb> and it all worked. > > Then I looked into what was different between the two gso_types and the > only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM. > I am sure I could have been more delicate with the aproach, but hey, it > worked for me. > > I would be curious to see if this is the same issue as mine. It seems like > it is. > Oh, and if I remember correctly, gso_type without VMs involved was 129 (SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and with VM it was 133 (SKB_GSO_UDP_TUNNEL | SKB_GSO_DODGY | SKB_GSO_TCPV4). > > > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe netdev" in > > >> the body of a message to majordomo@vger.kernel.org > > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >> > > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 0:35 ` Joseph Gasparakis 2013-12-04 0:34 ` Alexei Starovoitov 2013-12-04 0:44 ` Joseph Gasparakis @ 2013-12-04 8:35 ` Or Gerlitz 2013-12-04 9:24 ` Joseph Gasparakis 2 siblings, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-04 8:35 UTC (permalink / raw) To: Joseph Gasparakis Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Wed, Dec 4, 2013 at 2:35 AM, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote: > I was printing the gso_type in vxlan_xmit_skb(), right before > iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the > gso_type was different when a VM was involved and when it was not > (although I was transmitting exactly the same packet), and then I replaced > my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had > for non-VM skb> and it all worked. > > Then I looked into what was different between the two gso_types and the > only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM. > I am sure I could have been more delicate with the aproach, but hey, it > worked for me. > > I would be curious to see if this is the same issue as mine. It seems like it is. nope! with the latest net tree, after handle_offloads is called in vxlan_xmit_skb and before iptunnel_xmit is invoked, skb_shinfo(skb)->gso_type is either 0 or 0x201 which is (SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and its the same value whether the session runs over veth device or directly over the bridge, where over veth and > 1 stream we see drops, bad perf, etc. I am very interested in the VM case too, so will check it out and let you know. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 8:35 ` Or Gerlitz @ 2013-12-04 9:24 ` Joseph Gasparakis 2013-12-04 9:41 ` Or Gerlitz 0 siblings, 1 reply; 63+ messages in thread From: Joseph Gasparakis @ 2013-12-04 9:24 UTC (permalink / raw) To: Or Gerlitz Cc: Joseph Gasparakis, Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Wed, 4 Dec 2013, Or Gerlitz wrote: > On Wed, Dec 4, 2013 at 2:35 AM, Joseph Gasparakis > <joseph.gasparakis@intel.com> wrote: > > > I was printing the gso_type in vxlan_xmit_skb(), right before > > iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the > > gso_type was different when a VM was involved and when it was not > > (although I was transmitting exactly the same packet), and then I replaced > > my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had > > for non-VM skb> and it all worked. > > > > Then I looked into what was different between the two gso_types and the > > only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM. > > I am sure I could have been more delicate with the aproach, but hey, it > > worked for me. > > > > I would be curious to see if this is the same issue as mine. It seems like it is. > > nope! with the latest net tree, after handle_offloads is called in > vxlan_xmit_skb and before iptunnel_xmit is invoked, > skb_shinfo(skb)->gso_type is either 0 or 0x201 which is > (SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and its the same value whether > the session runs over veth device or directly over the bridge, where > over veth and > 1 stream we see drops, bad perf, etc. > > I am very interested in the VM case too, so will check it out and let you know. > Ok, I was really hoping that would be the same... And just for the record, you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests before it. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 9:24 ` Joseph Gasparakis @ 2013-12-04 9:41 ` Or Gerlitz 2013-12-04 15:20 ` Or Gerlitz [not found] ` <52A197DF.5010806@mellanox.com> 0 siblings, 2 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-04 9:41 UTC (permalink / raw) To: Joseph Gasparakis Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Wed, Dec 4, 2013 at 11:24 AM, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote: > On Wed, 4 Dec 2013, Or Gerlitz wrote: >> nope! with the latest net tree, after handle_offloads is called in >> vxlan_xmit_skb and before iptunnel_xmit is invoked, >> skb_shinfo(skb)->gso_type is either 0 or 0x201 which is >> (SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and its the same value whether >> the session runs over veth device or directly over the bridge, where >> over veth and > 1 stream we see drops, bad perf, etc. >> I am very interested in the VM case too, so will check it out and let you know. > Ok, I was really hoping that would be the same... So when running traffic from VM I do see SKB_GSO_DODGY bit being set! my environment for running VMs with sane peformance was screwed a bit, I will bring it up later today and see if unsetting the bit helps. > And just for the record, > you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was > seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO > support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests > before it. indeed, also, on what kernel did you conducted your tests which you managed to WA the problem with unsetting that bit? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-04 9:41 ` Or Gerlitz @ 2013-12-04 15:20 ` Or Gerlitz [not found] ` <52A197DF.5010806@mellanox.com> 1 sibling, 0 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-04 15:20 UTC (permalink / raw) To: Joseph Gasparakis Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Wed, Dec 4, 2013 at 11:41 AM, Or Gerlitz <or.gerlitz@gmail.com> wrote: > On Wed, Dec 4, 2013 at 11:24 AM, Joseph Gasparakis > <joseph.gasparakis@intel.com> wrote: >> On Wed, 4 Dec 2013, Or Gerlitz wrote: > >>> nope! with the latest net tree, after handle_offloads is called in >>> vxlan_xmit_skb and before iptunnel_xmit is invoked, >>> skb_shinfo(skb)->gso_type is either 0 or 0x201 which is >>> (SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and its the same value whether >>> the session runs over veth device or directly over the bridge, where >>> over veth and > 1 stream we see drops, bad perf, etc. >>> I am very interested in the VM case too, so will check it out and let you know. > >> Ok, I was really hoping that would be the same... > > So when running traffic from VM I do see SKB_GSO_DODGY bit being set! > my environment for running VMs with sane peformance was screwed a bit, I will bring it > up later today and see if unsetting the bit helps. so if the DODGY bit is left set, it simply doesn't work! it seems that guest GSO packets are just dropped As for the performance when the bit is unset in the vxlan driver as you suggested, I need to tune the VMs a bit more, later or tomorrow ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <52A197DF.5010806@mellanox.com>]
* Re: vxlan/veth performance issues on net.git + latest kernels [not found] ` <52A197DF.5010806@mellanox.com> @ 2013-12-06 9:30 ` Or Gerlitz 2013-12-08 12:43 ` Mike Rapoport 2013-12-06 10:30 ` Joseph Gasparakis 1 sibling, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-06 9:30 UTC (permalink / raw) To: Or Gerlitz Cc: Joseph Gasparakis, Pravin B Shelar, Eric Dumazet, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, Kirsher, Jeffrey T, John Fastabend > On 04/12/2013 11:41, Or Gerlitz wrote: > On Wed, Dec 4, 2013 at 11:24 AM, Joseph Gasparakis > <joseph.gasparakis@intel.com> wrote: > >> And just for the record, >> you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was >> seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO >> support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests >> before it. > > indeed, also, on what kernel did you conducted your tests which you managed > to WA the problem with unsetting that bit? Hi Joseph, Really need your response here -- 1. on which kernel did you manage to get along fine vxlan performance wise with this hack? 2. did the hack helped for both veth host traffic or only on PV VM traffic? Currently it doesn't converge with 3.12.x or net.git, with veth/vxlan the DODGE bit isn't set when looking on the skb in the vxlan xmit time, so there's nothing for me to hack there. For VMs without unsetting the bit things don't really work, but unsetting it for itself so far didn't get me far performance wise. BTW guys, I saw the issues with both bridge/openvswitch configuration - seems that we might have here somehow large breakage of the system w.r.t vxlan traffic for rates that go over few Gbs -- so would love to get feedback of any kind from the people that were involved with vxlan over the last months/year. Or. net.git]# grep -rn SKB_GSO_DODGY drivers/net/ net/ipv4 net/core drivers/net/macvtap.c:585: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; drivers/net/tun.c:1135: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; drivers/net/virtio_net.c:497: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; drivers/net/xen-netback/netback.c:1146: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; drivers/net/xen-netfront.c:823: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; net/ipv4/af_inet.c:1264: SKB_GSO_DODGY | net/ipv4/tcp_offload.c:56: SKB_GSO_DODGY | net/ipv4/gre_offload.c:40: SKB_GSO_DODGY | net/ipv4/udp_offload.c:53: if (unlikely(type & ~(SKB_GSO_UDP | SKB_GSO_DODGY | net/core/dev.c:2694: if (shinfo->gso_type & SKB_GSO_DODGY) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-06 9:30 ` Or Gerlitz @ 2013-12-08 12:43 ` Mike Rapoport 2013-12-08 13:07 ` Or Gerlitz 0 siblings, 1 reply; 63+ messages in thread From: Mike Rapoport @ 2013-12-08 12:43 UTC (permalink / raw) To: Or Gerlitz Cc: Or Gerlitz, Joseph Gasparakis, Pravin B Shelar, Eric Dumazet, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, Kirsher, Jeffrey T, John Fastabend On Fri, Dec 06, 2013 at 11:30:37AM +0200, Or Gerlitz wrote: > > On 04/12/2013 11:41, Or Gerlitz wrote: > > BTW guys, I saw the issues with both bridge/openvswitch configuration > - seems that we might have here somehow large breakage of the system > w.r.t vxlan traffic for rates that go over few Gbs -- so would love to > get feedback of any kind from the people that were involved with vxlan > over the last months/year. I've seen similar problems with vxlan traffic. In our scenario I had two VMs running on the same host and both VMs having the { veth --> bridge --> vlxan --> IP stack --> NIC } chain. Running iperf on veth showed rate ~6 times slower than direct NIC <-> NIC. With a hack that forces large gso_size in vxlan's handle_offloads, I've got veth performing only slightly slower than NICs ... The explanation I thought of is that performing the split of the packet as late as possible reduces processing overhead and allows more data to be processed. My $0.02 > > Or. > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-08 12:43 ` Mike Rapoport @ 2013-12-08 13:07 ` Or Gerlitz 2013-12-08 14:30 ` Mike Rapoport 0 siblings, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-08 13:07 UTC (permalink / raw) To: Mike Rapoport, Or Gerlitz Cc: Joseph Gasparakis, Pravin B Shelar, Eric Dumazet, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, Kirsher, Jeffrey T, John Fastabend On 08/12/2013 14:43, Mike Rapoport wrote: > On Fri, Dec 06, 2013 at 11:30:37AM +0200, Or Gerlitz wrote: >>> On 04/12/2013 11:41, Or Gerlitz wrote: >> BTW guys, I saw the issues with both bridge/openvswitch configuration >> - seems that we might have here somehow large breakage of the system >> w.r.t vxlan traffic for rates that go over few Gbs -- so would love to >> get feedback of any kind from the people that were involved with vxlan >> over the last months/year. > I've seen similar problems with vxlan traffic. In our scenario I had two VMs running on the same host and both VMs having the { veth --> bridge --> vlxan --> IP stack --> NIC } chain. How the VMs were connected to the veth NICs? what kernel were you using? > Running iperf on veth showed rate ~6 times slower than direct NIC <-> NIC. With a hack that forces large gso_size in vxlan's handle_offloads, I've got veth performing only slightly slower than NICs ... The explanation I thought of is that performing the split of the packet as late as possible reduces processing overhead and allows more data to be processed. thanks for the tip! few quick clarifications -- so you artificially enlarged the gso_size of the skb? can you provide the line you added here static int handle_offloads(struct sk_buff *skb) { if (skb_is_gso(skb)) { int err = skb_unclone(skb, GFP_ATOMIC); if (unlikely(err)) return err; skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL; } else if (skb->ip_summed != CHECKSUM_PARTIAL) skb->ip_summed = CHECKSUM_NONE; return 0; } also, why enlarging the gso size for skb's cause the actual segmentation to come into play lower in the stack? Or. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-08 13:07 ` Or Gerlitz @ 2013-12-08 14:30 ` Mike Rapoport 2013-12-08 20:50 ` Eric Dumazet 0 siblings, 1 reply; 63+ messages in thread From: Mike Rapoport @ 2013-12-08 14:30 UTC (permalink / raw) To: Or Gerlitz Cc: Or Gerlitz, Joseph Gasparakis, Pravin B Shelar, Eric Dumazet, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, Kirsher, Jeffrey T, John Fastabend On Sun, Dec 08, 2013 at 03:07:54PM +0200, Or Gerlitz wrote: > On 08/12/2013 14:43, Mike Rapoport wrote: > > On Fri, Dec 06, 2013 at 11:30:37AM +0200, Or Gerlitz wrote: > >>> On 04/12/2013 11:41, Or Gerlitz wrote: > >> BTW guys, I saw the issues with both bridge/openvswitch configuration > >> - seems that we might have here somehow large breakage of the system > >> w.r.t vxlan traffic for rates that go over few Gbs -- so would love to > >> get feedback of any kind from the people that were involved with vxlan > >> over the last months/year. > > I've seen similar problems with vxlan traffic. In our scenario I had two VMs running on the same host and both VMs having the { veth --> bridge --> vlxan --> IP stack --> NIC } chain. > > How the VMs were connected to the veth NICs? what kernel were you using? > > > > Running iperf on veth showed rate ~6 times slower than direct NIC <-> NIC. With a hack that forces large gso_size in vxlan's handle_offloads, I've got veth performing only slightly slower than NICs ... The explanation I thought of is that performing the split of the packet as late as possible reduces processing overhead and allows more data to be processed. > > thanks for the tip! few quick clarifications -- so you artificially > enlarged the gso_size of the skb? can you provide the line you added here It was something *very* hacky: static int handle_offloads(struct sk_buff *skb) { if (skb_is_gso(skb)) { int err = skb_unclone(skb, GFP_ATOMIC); if (unlikely(err)) return err; skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL; if (skb->len < 64000) skb_shinfo(skb)->gso_size = skb->len; else skb_shinfo(skb)->gso_size = 64000; } else if (skb->ip_summed != CHECKSUM_PARTIAL) skb->ip_summed = CHECKSUM_NONE; return 0; } > also, why enlarging the gso size for skb's cause the actual segmentation > to come into play lower in the stack? > > Or. > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-08 14:30 ` Mike Rapoport @ 2013-12-08 20:50 ` Eric Dumazet 2013-12-08 21:36 ` Eric Dumazet 0 siblings, 1 reply; 63+ messages in thread From: Eric Dumazet @ 2013-12-08 20:50 UTC (permalink / raw) To: Mike Rapoport Cc: Or Gerlitz, Or Gerlitz, Joseph Gasparakis, Pravin B Shelar, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, Kirsher, Jeffrey T, John Fastabend On Sun, 2013-12-08 at 16:30 +0200, Mike Rapoport wrote: > > It was something *very* hacky: > if (skb->len < 64000) > skb_shinfo(skb)->gso_size = skb->len; > else > skb_shinfo(skb)->gso_size = 64000; This sounds like an 16bit overflow somewhere. This reminds the issue we fix in commit 50bceae9bd356 ("tcp: Reallocate headroom if it would overflow csum_start") You might try to reduce the 0xFFFF to something smaller. diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 993da005e087..8364bcfe1e08 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2373,7 +2373,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) * beyond what csum_start can cover. */ if (unlikely((NET_IP_ALIGN && ((unsigned long)skb->data & 3)) || - skb_headroom(skb) >= 0xFFFF)) { + skb_headroom(skb) >= 0xFF00)) { struct sk_buff *nskb = __pskb_copy(skb, MAX_TCP_HEADER, GFP_ATOMIC); return nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) : ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-08 20:50 ` Eric Dumazet @ 2013-12-08 21:36 ` Eric Dumazet 0 siblings, 0 replies; 63+ messages in thread From: Eric Dumazet @ 2013-12-08 21:36 UTC (permalink / raw) To: Mike Rapoport Cc: Or Gerlitz, Or Gerlitz, Joseph Gasparakis, Pravin B Shelar, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, Kirsher, Jeffrey T, John Fastabend On Sun, 2013-12-08 at 12:50 -0800, Eric Dumazet wrote: > On Sun, 2013-12-08 at 16:30 +0200, Mike Rapoport wrote: > > > > It was something *very* hacky: > > > if (skb->len < 64000) > > skb_shinfo(skb)->gso_size = skb->len; > > else > > skb_shinfo(skb)->gso_size = 64000; > > This sounds like an 16bit overflow somewhere. > > This reminds the issue we fix in commit 50bceae9bd356 > ("tcp: Reallocate headroom if it would overflow csum_start") > > You might try to reduce the 0xFFFF to something smaller. Also try following debugging patch : diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 2718fed53d8c..d6fcb6272d37 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -913,8 +913,12 @@ EXPORT_SYMBOL(skb_clone); static void skb_headers_offset_update(struct sk_buff *skb, int off) { /* Only adjust this if it actually is csum_start rather than csum */ - if (skb->ip_summed == CHECKSUM_PARTIAL) - skb->csum_start += off; + if (skb->ip_summed == CHECKSUM_PARTIAL) { + u32 val = (u32)skb->csum_start + off; + + WARN_ON_ONCE(val > 0xFFFF); + skb->csum_start = val; + } /* {transport,network,mac}_header and tail are relative to skb->head */ skb->transport_header += off; skb->network_header += off; ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels [not found] ` <52A197DF.5010806@mellanox.com> 2013-12-06 9:30 ` Or Gerlitz @ 2013-12-06 10:30 ` Joseph Gasparakis 2013-12-07 21:27 ` Or Gerlitz 2013-12-08 15:21 ` Or Gerlitz 1 sibling, 2 replies; 63+ messages in thread From: Joseph Gasparakis @ 2013-12-06 10:30 UTC (permalink / raw) To: Or Gerlitz Cc: Joseph Gasparakis, Pravin B Shelar, Or Gerlitz, Eric Dumazet, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, jeffrey.t.kirsher, John Fastabend On Fri, 6 Dec 2013, Or Gerlitz wrote: > On 04/12/2013 11:41, Or Gerlitz wrote: > > On Wed, Dec 4, 2013 at 11:24 AM, Joseph > > Gasparakis<joseph.gasparakis@intel.com> wrote: > > > >And just for the record, > > > >you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was > > > >seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO > > > >support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests > > > before it. > > indeed, also, on what kernel did you conducted your tests which you managed > > to WA the problem with unsetting that bit? > > > Hi Joseph, > > Really need your response here -- I'm sorry Or, I managed to miss your original request... > > 1. on which kernel did you manage to get along fine vxlan performance wise > with this hack? > I was running 3.10.6. > 2. did the hack helped for both veth host traffic or only on PV VM traffic? > No, just VM. I haven't tried veth. If you leave the DODGY bit, does your traffic get droped on Tx, after it leaves vxlan and before it hits your driver, which is what I had seen. Is that right? If you unset it, do you recover? What is the output of your ethtool -k on the interface you are transmitting from? > Currently it doesn't converge with 3.12.x or net.git, with veth/vxlan the > DODGE bit isn't set when looking on the skb in the vxlan xmit time, so there's > nothing for me to hack there. For VMs without unsetting the bit things don't > really work, but unsetting it for itself so far didn't get me far performance > wise. > > BTW guys, I saw the issues with both bridge/openvswitch configuration - seems > that we might have here somehow large breakage of the system w.r.t vxlan > traffic for rates that go over few Gbs -- so would love to get feedback of any > kind from the people that were involved with vxlan over the last months/year. > > Or. > > net.git]# grep -rn SKB_GSO_DODGY drivers/net/ net/ipv4 net/core > drivers/net/macvtap.c:585: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; > drivers/net/tun.c:1135: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; > drivers/net/virtio_net.c:497: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; > drivers/net/xen-netback/netback.c:1146: skb_shinfo(skb)->gso_type |= > SKB_GSO_DODGY; > drivers/net/xen-netfront.c:823: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; > net/ipv4/af_inet.c:1264: SKB_GSO_DODGY | > net/ipv4/tcp_offload.c:56: SKB_GSO_DODGY | > net/ipv4/gre_offload.c:40: SKB_GSO_DODGY | > net/ipv4/udp_offload.c:53: if (unlikely(type & ~(SKB_GSO_UDP | > SKB_GSO_DODGY | > net/core/dev.c:2694: if (shinfo->gso_type & SKB_GSO_DODGY) > > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-06 10:30 ` Joseph Gasparakis @ 2013-12-07 21:27 ` Or Gerlitz 2013-12-08 18:08 ` Joseph Gasparakis 2013-12-08 15:21 ` Or Gerlitz 1 sibling, 1 reply; 63+ messages in thread From: Or Gerlitz @ 2013-12-07 21:27 UTC (permalink / raw) To: Joseph Gasparakis Cc: Or Gerlitz, Pravin B Shelar, Eric Dumazet, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, Kirsher, Jeffrey T, John Fastabend On Fri, Dec 6, 2013, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote: > On Fri, 6 Dec 2013, Or Gerlitz wrote: > >> On 04/12/2013 11:41, Or Gerlitz wrote: >> > On Wed, Dec 4, 2013 at 11:24 AM, Joseph >> > Gasparakis<joseph.gasparakis@intel.com> wrote: >> > > >And just for the record, >> > > >you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was >> > > >seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO >> > > >support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests >> > > before it. >> > indeed, also, on what kernel did you conducted your tests which you managed >> > to WA the problem with unsetting that bit? >> >> >> Hi Joseph, >> >> Really need your response here -- > > I'm sorry Or, I managed to miss your original request... sure.. it happens. >> 1. on which kernel did you manage to get along fine vxlan performance wise >> with this hack? > I was running 3.10.6. I see, will try it out, and just for getting closer to your env, what kernel where the guests running? was a bridge / ovs instance involved in the VM PV connectivity? >> 2. did the hack helped for both veth host traffic or only on PV VM traffic? > No, just VM. I haven't tried veth. I see, earlier I was somehow under the impression you noted the problem for veth too. > If you leave the DODGY bit, does your traffic get droped on Tx, after it > leaves vxlan and before it hits your driver, which is what I had seen. Is > that right? What I saw is that if I leave the DODGY bit set, practically things don't work at all, its not that some packets are dropped, was that what you saw? Or on your env only **some** or **few** packets were dropped each time but this killed the tcp session performance? Also, did you hack/modified the VM NIC MTU to take into the account the encapsulation overhead? > If you unset it, do you recover? let me redo this with your setting and see, please make sure to tell me what kernel the VM was running too (thanks!) > What is the output of your ethtool -k on the interface you are > transmitting from? will send you tomorrow, but this happens without offloads for encapsulated traffic. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-07 21:27 ` Or Gerlitz @ 2013-12-08 18:08 ` Joseph Gasparakis 2013-12-08 20:12 ` Or Gerlitz 0 siblings, 1 reply; 63+ messages in thread From: Joseph Gasparakis @ 2013-12-08 18:08 UTC (permalink / raw) To: Or Gerlitz Cc: Joseph Gasparakis, Or Gerlitz, Pravin B Shelar, Eric Dumazet, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, Kirsher, Jeffrey T, John Fastabend On Sat, 7 Dec 2013, Or Gerlitz wrote: > On Fri, Dec 6, 2013, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote: > > On Fri, 6 Dec 2013, Or Gerlitz wrote: > > > >> On 04/12/2013 11:41, Or Gerlitz wrote: > >> > On Wed, Dec 4, 2013 at 11:24 AM, Joseph > >> > Gasparakis<joseph.gasparakis@intel.com> wrote: > >> > > >And just for the record, > >> > > >you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was > >> > > >seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO > >> > > >support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests > >> > > before it. > >> > indeed, also, on what kernel did you conducted your tests which you managed > >> > to WA the problem with unsetting that bit? > >> > >> > >> Hi Joseph, > >> > >> Really need your response here -- > > > > I'm sorry Or, I managed to miss your original request... > > sure.. it happens. > > >> 1. on which kernel did you manage to get along fine vxlan performance wise > >> with this hack? > > > I was running 3.10.6. > > I see, will try it out, and just for getting closer to your env, what > kernel where the guests running? was a bridge / ovs instance involved > in the VM PV connectivity? > The VMs were running an old 2.6.32 kernel, although if I remember well I also tried a 3.x - sorry can't remember which one. The VMs were attached to a bridge, but I haven't noticed any packet loss there. > > >> 2. did the hack helped for both veth host traffic or only on PV VM traffic? > > > No, just VM. I haven't tried veth. > > I see, earlier I was somehow under the impression you noted the > problem for veth too. > > > If you leave the DODGY bit, does your traffic get droped on Tx, after it > > leaves vxlan and before it hits your driver, which is what I had seen. Is > > that right? > > What I saw is that if I leave the DODGY bit set, practically things > don't work at all, its not that some packets are dropped, was that > what you saw? > What I saw was gso packets badly segmented, causing many re-transmissions and dropping the performance to a few MB/s. > Or on your env only **some** or **few** packets were dropped each time > but this killed the tcp session performance? > > Also, did you hack/modified the VM NIC MTU to take into the account > the encapsulation overhead? > The virtio interfaces I used had MTU 1500, but the MTU of the physical NIC was increased to 1600. > > If you unset it, do you recover? > > let me redo this with your setting and see, please make sure to tell > me what kernel the VM was running too (thanks!) > > > What is the output of your ethtool -k on the interface you are > > transmitting from? > > will send you tomorrow, but this happens without offloads for > encapsulated traffic. > I have only noticed this with the offloads on. Turning off encapsuation TSO off, would simply make the gso's to get segmented in dev_hard_xmit() as expected. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-08 18:08 ` Joseph Gasparakis @ 2013-12-08 20:12 ` Or Gerlitz 0 siblings, 0 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-08 20:12 UTC (permalink / raw) To: Joseph Gasparakis Cc: Or Gerlitz, Pravin B Shelar, Eric Dumazet, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, Kirsher, Jeffrey T, John Fastabend On Sun, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote: >> What I saw is that if I leave the DODGY bit set, practically things >> don't work at all, its not that some packets are dropped, was that >> what you saw? > What I saw was gso packets badly segmented, causing many re-transmissions > and dropping the performance to a few MB/s. Yes, in my testbed upto about 400Mbs (b not B..., yes!) >> Also, did you hack/modified the VM NIC MTU to take into the account >> the encapsulation overhead? > The virtio interfaces I used had MTU 1500, but the MTU of the physical NIC > was increased to 1600. mmm, that's sort of equivalent, but zero touch VM wise, nice! > I have only noticed this with the offloads on. Turning off encapsuation > TSO off, would simply make the gso's to get segmented in dev_hard_xmit() > as expected. mmm, I am not sure this is the case with kernels > 3.10.x, but I'd like to double check that, basically, its possible that I didn't make sure to always have "proper" MTU at the VM @ all times. Also, did you see the unsimilarity between TX/RX which I reported earlier today, that is accelerated TX from single VM can go as far as > 30Gbs while RX to single VM or even multiple VMs doesn't go beyond 5-6Gbs probably as of the lack of GRO? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-06 10:30 ` Joseph Gasparakis 2013-12-07 21:27 ` Or Gerlitz @ 2013-12-08 15:21 ` Or Gerlitz 1 sibling, 0 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-08 15:21 UTC (permalink / raw) To: Joseph Gasparakis Cc: Pravin B Shelar, Or Gerlitz, Eric Dumazet, Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev, jeffrey.t.kirsher, John Fastabend, Jerry Chu On 06/12/2013 12:30, Joseph Gasparakis wrote: > On Fri, 6 Dec 2013, Or Gerlitz wrote: > > >> 1. on which kernel did you manage to get along fine vxlan performance wise >> with this hack? >> > I was running 3.10.6. > >> 2. did the hack helped for both veth host traffic or only on PV VM traffic? >> > No, just VM. I haven't tried veth. > > If you leave the DODGY bit, does your traffic get droped on Tx, after it > leaves vxlan and before it hits your driver, which is what I had seen. Is > that right? > > If you unset it, do you recover? > > What is the output of your ethtool -k on the interface you are > transmitting from? > >> Currently it doesn't converge with 3.12.x or net.git, with veth/vxlan the >> DODGE bit isn't set when looking on the skb in the vxlan xmit time, so there's >> nothing for me to hack there. For VMs without unsetting the bit things don't >> really work, but unsetting it for itself so far didn't get me far performance >> wise. >> >> BTW guys, I saw the issues with both bridge/openvswitch configuration - seems >> that we might have here somehow large breakage of the system w.r.t vxlan >> traffic for rates that go over few Gbs -- so would love to get feedback of any >> kind from the people that were involved with vxlan over the last months/year. >> >> OK!! so finally I managed to get some hacked but stable ground to step on .... indeed with 3.10.X (I tried 3.10.19) if you 1. reduce the VM PV NIC MTU to account for the vxlan tunneling overhead (e.g to 1450 vs 1500) 2. unset the DODGY bit for GSO packets in the vxlan driver handle_offloads function --> You get sane vxlan performance when the VM xmits, without HW offloads I got up to 4-5 Gbs for single VM and with HW offloads > 30Gbs for single VM when the VM is sending to a peer hypervisor. On VM RX side, it doesn't go too much higher, e.g stays in the order of 3-4Gbs for single receiving VM, I am pretty sure this relates to the no GRO for vxlan which is pretty much terrible for VM traffic. So it seems the TODO here is the following: 1. manage to get the hack for vm-vxlan traffic to work on the net tree 2. fix the bug that make the hack necessary 3. find the problem with veth-vxlan traffic on the net tree 4. add GRO support for encapsulated/vxlan traffic Or. Or. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 15:05 vxlan/veth performance issues on net.git + latest kernels Or Gerlitz 2013-12-03 15:30 ` Eric Dumazet @ 2013-12-03 17:12 ` Eric Dumazet 2013-12-03 19:50 ` Or Gerlitz 1 sibling, 1 reply; 63+ messages in thread From: Eric Dumazet @ 2013-12-03 17:12 UTC (permalink / raw) To: Or Gerlitz Cc: Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller, netdev On Tue, 2013-12-03 at 17:05 +0200, Or Gerlitz wrote: > ---------------------- > > ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN > ifconfig vxlan42 192.168.42.144/24 up > > brctl addbr br-vx > ip link set br-vx up > > ifconfig br-vx 192.168.52.144/24 up > brctl addif br-vx vxlan42 > > ip link add type veth > brctl addif br-vx veth1 Aside question : If I do not have brctl on my host, how can I use "ip" command to perform this "brctl addif" action ? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 17:12 ` Eric Dumazet @ 2013-12-03 19:50 ` Or Gerlitz 2013-12-03 20:19 ` John Fastabend 2013-12-03 21:12 ` Eric Dumazet 0 siblings, 2 replies; 63+ messages in thread From: Or Gerlitz @ 2013-12-03 19:50 UTC (permalink / raw) To: Eric Dumazet; +Cc: Or Gerlitz, netdev On Tue, Dec 3, 2013 at 7:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> brctl addbr br-vx >> ip link set br-vx up >> ifconfig br-vx 192.168.52.144/24 up >> brctl addif br-vx vxlan42 >> ip link add type veth >> brctl addif br-vx veth1 > Aside question : If I do not have brctl on my host, > how can I use "ip" command to perform this "brctl addif" action ? you can get the sources from git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git or from http://sourceforge.net/projects/bridge/files/bridge/bridge-utils-1.5.tar.gz/download, should be trivial to configure and build, if I remember correct > > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 19:50 ` Or Gerlitz @ 2013-12-03 20:19 ` John Fastabend 2013-12-03 21:12 ` Eric Dumazet 1 sibling, 0 replies; 63+ messages in thread From: John Fastabend @ 2013-12-03 20:19 UTC (permalink / raw) To: Or Gerlitz; +Cc: Eric Dumazet, Or Gerlitz, netdev On 12/3/2013 11:50 AM, Or Gerlitz wrote: > On Tue, Dec 3, 2013 at 7:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >>> brctl addbr br-vx >>> ip link set br-vx up >>> ifconfig br-vx 192.168.52.144/24 up >>> brctl addif br-vx vxlan42 >>> ip link add type veth >>> brctl addif br-vx veth1 > >> Aside question : If I do not have brctl on my host, >> how can I use "ip" command to perform this "brctl addif" action ? > > you can get the sources from > git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git > or from http://sourceforge.net/projects/bridge/files/bridge/bridge-utils-1.5.tar.gz/download, > should be trivial to configure and build, if I remember correct > > The 'brctl addif' should be the same as setting the master. #ip link set dev veth1 master br-vs and then use nomaster to delete the interface. .John ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: vxlan/veth performance issues on net.git + latest kernels 2013-12-03 19:50 ` Or Gerlitz 2013-12-03 20:19 ` John Fastabend @ 2013-12-03 21:12 ` Eric Dumazet 1 sibling, 0 replies; 63+ messages in thread From: Eric Dumazet @ 2013-12-03 21:12 UTC (permalink / raw) To: Or Gerlitz; +Cc: Or Gerlitz, netdev On Tue, 2013-12-03 at 21:50 +0200, Or Gerlitz wrote: > you can get the sources from > git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git > or from http://sourceforge.net/projects/bridge/files/bridge/bridge-utils-1.5.tar.gz/download, > should be trivial to configure and build, if I remember correct Well, I did not elaborate. Lets say I have some hosts where I cannot install a new binary, for whatever reason ;) Thanks, John Fastabend replied to the question. ^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2013-12-08 21:36 UTC | newest] Thread overview: 63+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-03 15:05 vxlan/veth performance issues on net.git + latest kernels Or Gerlitz 2013-12-03 15:30 ` Eric Dumazet 2013-12-03 19:55 ` Or Gerlitz 2013-12-03 21:11 ` Joseph Gasparakis 2013-12-03 21:09 ` Or Gerlitz 2013-12-03 21:24 ` Eric Dumazet 2013-12-03 21:36 ` Or Gerlitz 2013-12-03 21:50 ` David Miller 2013-12-03 21:55 ` Eric Dumazet 2013-12-03 22:15 ` Or Gerlitz 2013-12-03 22:22 ` Or Gerlitz 2013-12-03 22:30 ` Hannes Frederic Sowa 2013-12-03 22:35 ` Or Gerlitz 2013-12-03 22:39 ` Hannes Frederic Sowa 2013-12-03 23:10 ` Or Gerlitz 2013-12-03 23:30 ` Or Gerlitz 2013-12-03 23:49 ` Hannes Frederic Sowa 2013-12-03 23:59 ` Eric Dumazet 2013-12-04 0:26 ` Alexei Starovoitov 2013-12-04 0:36 ` Eric Dumazet 2013-12-04 0:55 ` Alexei Starovoitov 2013-12-04 1:23 ` Eric Dumazet 2013-12-04 1:59 ` Alexei Starovoitov 2013-12-06 9:06 ` Or Gerlitz 2013-12-06 13:36 ` Eric Dumazet 2013-12-07 21:20 ` Or Gerlitz 2013-12-08 12:09 ` Or Gerlitz 2013-12-04 6:39 ` David Miller 2013-12-04 17:40 ` Eric Dumazet 2013-12-05 12:45 ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet 2013-12-05 14:13 ` Hannes Frederic Sowa 2013-12-05 14:45 ` Eric Dumazet 2013-12-05 15:05 ` Eric Dumazet 2013-12-05 15:44 ` Hannes Frederic Sowa 2013-12-05 16:38 ` Eric Dumazet 2013-12-05 16:54 ` Hannes Frederic Sowa 2013-12-06 20:24 ` David Miller 2013-12-03 23:13 ` vxlan/veth performance issues on net.git + latest kernels Joseph Gasparakis 2013-12-03 23:09 ` Or Gerlitz 2013-12-04 0:35 ` Joseph Gasparakis 2013-12-04 0:34 ` Alexei Starovoitov 2013-12-04 1:29 ` Joseph Gasparakis 2013-12-04 1:18 ` Eric Dumazet 2013-12-04 0:44 ` Joseph Gasparakis 2013-12-04 8:35 ` Or Gerlitz 2013-12-04 9:24 ` Joseph Gasparakis 2013-12-04 9:41 ` Or Gerlitz 2013-12-04 15:20 ` Or Gerlitz [not found] ` <52A197DF.5010806@mellanox.com> 2013-12-06 9:30 ` Or Gerlitz 2013-12-08 12:43 ` Mike Rapoport 2013-12-08 13:07 ` Or Gerlitz 2013-12-08 14:30 ` Mike Rapoport 2013-12-08 20:50 ` Eric Dumazet 2013-12-08 21:36 ` Eric Dumazet 2013-12-06 10:30 ` Joseph Gasparakis 2013-12-07 21:27 ` Or Gerlitz 2013-12-08 18:08 ` Joseph Gasparakis 2013-12-08 20:12 ` Or Gerlitz 2013-12-08 15:21 ` Or Gerlitz 2013-12-03 17:12 ` Eric Dumazet 2013-12-03 19:50 ` Or Gerlitz 2013-12-03 20:19 ` John Fastabend 2013-12-03 21:12 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).