* Slow speed of tcp connections in a network namespace @ 2012-12-29 9:24 Andrew Vagin 2012-12-29 13:53 ` Eric Dumazet 0 siblings, 1 reply; 16+ messages in thread From: Andrew Vagin @ 2012-12-29 9:24 UTC (permalink / raw) To: netdev; +Cc: vvs [-- Attachment #1: Type: text/plain, Size: 1419 bytes --] We found a few nodes, where network works slow in containers. For testing speed of TCP connections we use wget, which downloads iso images from the internet. wget in the new netns reports only 1.5 MB/s, but wget in the root netns reports 33MB/s. A few facts: * Experiments shows that window size for CT traffic does not increases up to ~900, however for host traffic window size increases up to ~14000 * packets are shuffled in the netns sometimes. * tso/gro/gso changes on interfaces does not help * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1 I reduced steps to reproduce: * Create a new network namespace "test" and a veth pair. # ip netns add test # ip link add name veth0 type veth peer name veth1 * Move veth1 into the netns test # ip link set veth1 netns test * Set ip address on veth1 and proper routing rules are added for this ip in the root netns. # ip link set up dev veth0; ip link set up dev veth0 # ip netns exec test ip a add REMOTE dev veth1 # ip netns exec test ip r a default via veth1 # ip r a REMOTE/32 via dev veth0 Tcpdump for both cases are attached to this message. tcpdump.host - wget in the root netns tcpdump.netns.host - tcpdump for the host device, wget in the new netns tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns 3.8-rc1 is used for experiments. Do you have any ideas where is a problem? [-- Attachment #2: tcpdump.host.gz --] [-- Type: application/x-gzip, Size: 168126 bytes --] [-- Attachment #3: tcpdump.netns.veth.gz --] [-- Type: application/x-gzip, Size: 178809 bytes --] [-- Attachment #4: tcpdump.netns.host.gz --] [-- Type: application/x-gzip, Size: 178424 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 9:24 Slow speed of tcp connections in a network namespace Andrew Vagin @ 2012-12-29 13:53 ` Eric Dumazet 2012-12-29 14:50 ` Andrew Vagin 2012-12-29 16:01 ` Michał Mirosław 0 siblings, 2 replies; 16+ messages in thread From: Eric Dumazet @ 2012-12-29 13:53 UTC (permalink / raw) To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław On Sat, 2012-12-29 at 13:24 +0400, Andrew Vagin wrote: > We found a few nodes, where network works slow in containers. > > For testing speed of TCP connections we use wget, which downloads iso > images from the internet. > > wget in the new netns reports only 1.5 MB/s, but wget in the root netns > reports 33MB/s. > > A few facts: > * Experiments shows that window size for CT traffic does not increases > up to ~900, however for host traffic window size increases up to ~14000 > * packets are shuffled in the netns sometimes. > * tso/gro/gso changes on interfaces does not help > * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1 > > I reduced steps to reproduce: > * Create a new network namespace "test" and a veth pair. > # ip netns add test > # ip link add name veth0 type veth peer name veth1 > > * Move veth1 into the netns test > # ip link set veth1 netns test > > * Set ip address on veth1 and proper routing rules are added for this ip > in the root netns. > # ip link set up dev veth0; ip link set up dev veth0 > # ip netns exec test ip a add REMOTE dev veth1 > # ip netns exec test ip r a default via veth1 > # ip r a REMOTE/32 via dev veth0 > > Tcpdump for both cases are attached to this message. > tcpdump.host - wget in the root netns > tcpdump.netns.host - tcpdump for the host device, wget in the new netns > tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns > > 3.8-rc1 is used for experiments. > > Do you have any ideas where is a problem? veth has absolutely no offload features It needs some care... At the very miminum, let TCP coalesce do its job by allowing SG CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights. Please try following patch : diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 95814d9..9fefeb3 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = { .ndo_set_mac_address = eth_mac_addr, }; +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | \ + NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | \ + NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX) + static void veth_setup(struct net_device *dev) { ether_setup(dev); @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev) dev->netdev_ops = &veth_netdev_ops; dev->ethtool_ops = &veth_ethtool_ops; dev->features |= NETIF_F_LLTX; + dev->features |= VETH_FEATURES; dev->destructor = veth_dev_free; - dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM; + dev->hw_features = VETH_FEATURES; } /* ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 13:53 ` Eric Dumazet @ 2012-12-29 14:50 ` Andrew Vagin 2012-12-29 17:40 ` Eric Dumazet 2012-12-29 16:01 ` Michał Mirosław 1 sibling, 1 reply; 16+ messages in thread From: Andrew Vagin @ 2012-12-29 14:50 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote: > > 3.8-rc1 is used for experiments. > > > > Do you have any ideas where is a problem? > > veth has absolutely no offload features > > It needs some care... > > At the very miminum, let TCP coalesce do its job by allowing SG > > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights. > > Please try following patch : Hello Eric, Thanks for your feedback. With this patch the results is a bit better (~4MB/s), but it's much less than in the root netns. > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > index 95814d9..9fefeb3 100644 > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = { > .ndo_set_mac_address = eth_mac_addr, > }; > > +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | \ > + NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | \ > + NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX) > + > static void veth_setup(struct net_device *dev) > { > ether_setup(dev); > @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev) > dev->netdev_ops = &veth_netdev_ops; > dev->ethtool_ops = &veth_ethtool_ops; > dev->features |= NETIF_F_LLTX; > + dev->features |= VETH_FEATURES; > dev->destructor = veth_dev_free; > > - dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM; > + dev->hw_features = VETH_FEATURES; > } > > /* > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 14:50 ` Andrew Vagin @ 2012-12-29 17:40 ` Eric Dumazet 2012-12-29 18:29 ` Andrew Vagin 2012-12-29 18:58 ` Eric Dumazet 0 siblings, 2 replies; 16+ messages in thread From: Eric Dumazet @ 2012-12-29 17:40 UTC (permalink / raw) To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław On Sat, 2012-12-29 at 18:50 +0400, Andrew Vagin wrote: > On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote: > > > 3.8-rc1 is used for experiments. > > > > > > Do you have any ideas where is a problem? > > > > veth has absolutely no offload features > > > > It needs some care... > > > > At the very miminum, let TCP coalesce do its job by allowing SG > > > > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights. > > > > Please try following patch : > > Hello Eric, > > Thanks for your feedback. > > With this patch the results is a bit better (~4MB/s), but it's much less > than in the root netns. Please post your new tcpdump then ;) also post "netstat -s" from root and test ns after your wgets ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 17:40 ` Eric Dumazet @ 2012-12-29 18:29 ` Andrew Vagin 2012-12-29 18:58 ` Eric Dumazet 1 sibling, 0 replies; 16+ messages in thread From: Andrew Vagin @ 2012-12-29 18:29 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław [-- Attachment #1: Type: text/plain, Size: 1014 bytes --] On Sat, Dec 29, 2012 at 09:40:28AM -0800, Eric Dumazet wrote: > On Sat, 2012-12-29 at 18:50 +0400, Andrew Vagin wrote: > > On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote: > > > > 3.8-rc1 is used for experiments. > > > > > > > > Do you have any ideas where is a problem? > > > > > > veth has absolutely no offload features > > > > > > It needs some care... > > > > > > At the very miminum, let TCP coalesce do its job by allowing SG > > > > > > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights. > > > > > > Please try following patch : > > > > Hello Eric, > > > > Thanks for your feedback. > > > > With this patch the results is a bit better (~4MB/s), but it's much less > > than in the root netns. > > Please post your new tcpdump then ;) I have rebooted the host and a speed in a netns is again about 1.7MB/s. I don't know why it was 4MB/s in the previous time. new tcpdump and netstat are attached > > also post "netstat -s" from root and test ns after your wgets > > > [-- Attachment #2: tcpdump.host.gz --] [-- Type: application/x-gzip, Size: 165716 bytes --] [-- Attachment #3: tcpdump.netns.host.gz --] [-- Type: application/x-gzip, Size: 180703 bytes --] [-- Attachment #4: tcpdump.netns.veth.gz --] [-- Type: application/x-gzip, Size: 181615 bytes --] [-- Attachment #5: netstat.host --] [-- Type: text/plain, Size: 1821 bytes --] Ip: 277536 total packets received 20 forwarded 0 incoming packets discarded 202326 incoming packets delivered 108228 requests sent out 30 dropped because of missing route Icmp: 10 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 5 echo requests: 5 6 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 1 echo replies: 5 IcmpMsg: InType3: 5 InType8: 5 OutType0: 5 OutType3: 1 Tcp: 1491 active connections openings 12 passive connection openings 14 failed connection attempts 72 connection resets received 2 connections established 201920 segments received 107815 segments send out 0 segments retransmited 0 bad segments received. 1338 resets sent Udp: 387 packets received 0 packets to unknown port received. 0 packet receive errors 389 packets sent UdpLite: TcpExt: 3 invalid SYN cookies received 4 TCP sockets finished time wait in fast timer 63 delayed acks sent 9 delayed acks further delayed because of locked socket 236 packets directly queued to recvmsg prequeue. 38600456 packets directly received from backlog 298101 packets directly received from prequeue 163501 packets header predicted 27103 packets header predicted and directly queued to user 2578 acknowledgments not containing data received 1018 predicted acknowledgments 15 connections reset due to unexpected data 72 connections reset due to early user close TCPRcvCoalesce: 123 TCPOFOQueue: 1187 IpExt: InBcastPkts: 9 OutBcastPkts: 1 InOctets: 296395504 OutOctets: 6965311 InBcastOctets: 2789 OutBcastOctets: 165 [-- Attachment #6: netstat.netns --] [-- Type: text/plain, Size: 1463 bytes --] Ip: 25483 total packets received 0 forwarded 0 incoming packets discarded 25483 incoming packets delivered 14572 requests sent out Icmp: 4 ICMP messages received 0 input ICMP message failed. ICMP input histogram: echo requests: 2 echo replies: 2 4 ICMP messages sent 0 ICMP messages failed ICMP output histogram: echo request: 2 echo replies: 2 IcmpMsg: InType0: 2 InType8: 2 OutType0: 2 OutType8: 2 Tcp: 1 active connections openings 0 passive connection openings 0 failed connection attempts 0 connection resets received 0 connections established 25473 segments received 14562 segments send out 0 segments retransmited 0 bad segments received. 77 resets sent Udp: 6 packets received 0 packets to unknown port received. 0 packet receive errors 6 packets sent UdpLite: TcpExt: 38 delayed acks sent Quick ack mode was activated 2 times 52 packets directly queued to recvmsg prequeue. 4916752 packets directly received from backlog 11584 packets directly received from prequeue 12538 packets header predicted 2649 packets header predicted and directly queued to user 1 acknowledgments not containing data received 2 DSACKs sent for old packets 1 connections reset due to unexpected data TCPOFOQueue: 1580 IpExt: InOctets: 38201843 OutOctets: 829966 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 17:40 ` Eric Dumazet 2012-12-29 18:29 ` Andrew Vagin @ 2012-12-29 18:58 ` Eric Dumazet 2012-12-29 19:41 ` Eric Dumazet 2012-12-29 21:15 ` Andrew Vagin 1 sibling, 2 replies; 16+ messages in thread From: Eric Dumazet @ 2012-12-29 18:58 UTC (permalink / raw) To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > Please post your new tcpdump then ;) > > also post "netstat -s" from root and test ns after your wgets Also try following bnx2 patch. It should help GRO / TCP coalesce bnx2 should be the last driver not using skb head_frag diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c index a1adfaf..08a2d40 100644 --- a/drivers/net/ethernet/broadcom/bnx2.c +++ b/drivers/net/ethernet/broadcom/bnx2.c @@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index) rx_pg->page = NULL; } +static void bnx2_frag_free(const struct bnx2 *bp, void *data) +{ + if (bp->rx_frag_size) + put_page(virt_to_head_page(data)); + else + kfree(data); +} + static inline int bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp) { @@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf struct bnx2_rx_bd *rxbd = &rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)]; - data = kmalloc(bp->rx_buf_size, gfp); + if (bp->rx_frag_size) + data = netdev_alloc_frag(bp->rx_frag_size); + else + data = kmalloc(bp->rx_buf_size, gfp); if (!data) return -ENOMEM; @@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf bp->rx_buf_use_size, PCI_DMA_FROMDEVICE); if (dma_mapping_error(&bp->pdev->dev, mapping)) { - kfree(data); + bnx2_frag_free(bp, data); return -EIO; } @@ -3014,9 +3025,9 @@ error: dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size, PCI_DMA_FROMDEVICE); - skb = build_skb(data, 0); + skb = build_skb(data, bp->rx_frag_size); if (!skb) { - kfree(data); + bnx2_frag_free(bp, data); goto error; } skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET); @@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size) /* hw alignment + build_skb() overhead*/ bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) + NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + if (bp->rx_buf_size <= PAGE_SIZE) + bp->rx_frag_size = bp->rx_buf_size; + else + bp->rx_frag_size = 0; bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET; bp->rx_ring_size = size; bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS); @@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp) rx_buf->data = NULL; - kfree(data); + bnx2_frag_free(bp, data); } for (j = 0; j < bp->rx_max_pg_ring_idx; j++) bnx2_free_rx_page(bp, rxr, j); diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h index 172efbe..11f5dee 100644 --- a/drivers/net/ethernet/broadcom/bnx2.h +++ b/drivers/net/ethernet/broadcom/bnx2.h @@ -6804,6 +6804,7 @@ struct bnx2 { u32 rx_buf_use_size; /* useable size */ u32 rx_buf_size; /* with alignment */ + u32 rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */ u32 rx_copy_thresh; u32 rx_jumbo_thresh; u32 rx_max_ring_idx; ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 18:58 ` Eric Dumazet @ 2012-12-29 19:41 ` Eric Dumazet 2012-12-29 20:08 ` Andrew Vagin 2012-12-29 21:15 ` Andrew Vagin 1 sibling, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2012-12-29 19:41 UTC (permalink / raw) To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote: > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > Please post your new tcpdump then ;) > > > > also post "netstat -s" from root and test ns after your wgets > > Also try following bnx2 patch. > > It should help GRO / TCP coalesce > > bnx2 should be the last driver not using skb head_frag And of course, you should make sure all your bnx2 interrupts are handled by the same cpu. Or else, packets might be reordered because the way dev_forward_skb() works. (CPU X gets a bunch of packets from eth0, forward them via netif_rx() in the local CPU X queue, NAPI is ended on eth0) CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in the local CPU Y queue. CPU X and Y process their local queue in // -> packets are delivered Out of order to TCP stack Alternative is to setup RPS on your veth1 device, to force packets being delivered/handled by a given cpu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 19:41 ` Eric Dumazet @ 2012-12-29 20:08 ` Andrew Vagin 2012-12-29 20:20 ` Eric Dumazet 2012-12-29 21:12 ` Eric Dumazet 0 siblings, 2 replies; 16+ messages in thread From: Andrew Vagin @ 2012-12-29 20:08 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote: > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote: > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > > > > Please post your new tcpdump then ;) > > > > > > also post "netstat -s" from root and test ns after your wgets > > > > Also try following bnx2 patch. > > > > It should help GRO / TCP coalesce > > > > bnx2 should be the last driver not using skb head_frag I don't have access to the host. I'm going to test your patch tomorrow. Thanks. > > And of course, you should make sure all your bnx2 interrupts are handled > by the same cpu. All bnx interrupts are handled on all cpus. They are handled on the same cpu, if a kernel is booted with msi_disable=1. Is it right, that a received window will be less, if packets are not sorted? Looks like a bug. I want to say, that probably it works correctly, if packets are sorted. But I think if packets are not sorted, it should work with the same speed, cpu load and memory consumption may be a bit more. > > Or else, packets might be reordered because the way dev_forward_skb() > works. > > (CPU X gets a bunch of packets from eth0, forward them via netif_rx() in > the local CPU X queue, NAPI is ended on eth0) > > CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in > the local CPU Y queue. > > CPU X and Y process their local queue in // -> packets are delivered Out > of order to TCP stack > > Alternative is to setup RPS on your veth1 device, to force packets being > delivered/handled by a given cpu > > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 20:08 ` Andrew Vagin @ 2012-12-29 20:20 ` Eric Dumazet 2012-12-29 21:07 ` Andrew Vagin 2012-12-29 21:12 ` Eric Dumazet 1 sibling, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2012-12-29 20:20 UTC (permalink / raw) To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote: > On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote: > > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote: > > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > > > > > > > Please post your new tcpdump then ;) > > > > > > > > also post "netstat -s" from root and test ns after your wgets > > > > > > Also try following bnx2 patch. > > > > > > It should help GRO / TCP coalesce > > > > > > bnx2 should be the last driver not using skb head_frag > > I don't have access to the host. I'm going to test your patch tomorrow. > Thanks. > > > > > And of course, you should make sure all your bnx2 interrupts are handled > > by the same cpu. > All bnx interrupts are handled on all cpus. They are handled on the same > cpu, if a kernel is booted with msi_disable=1. > > Is it right, that a received window will be less, if packets are not sorted? > Looks like a bug. > > I want to say, that probably it works correctly, if packets are sorted. > But I think if packets are not sorted, it should work with the same > speed, cpu load and memory consumption may be a bit more. Without veth, it doesnt really matter that IRQ are spread on multiple cpus, because packets are handled in NAPI, and only one cpu runs the eth0 NAPI handler at one time. But as soon as packets are queued (by netif_rx()) for 'later' processing, you can have dramatic performance decrease. Thats why you really should make sure IRQ on your eth0 device are handled by a single cpu. It will help to get better performance in most cases. echo 1 >/proc/irq/*/eth0/../smp_affinity If it doesnt work, you might try instead : echo 1 >/proc/irq/default_smp_affinity <you might need to reload bnx2 module, or ifdown/ifup eth0 > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 20:20 ` Eric Dumazet @ 2012-12-29 21:07 ` Andrew Vagin 0 siblings, 0 replies; 16+ messages in thread From: Andrew Vagin @ 2012-12-29 21:07 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław On Sat, Dec 29, 2012 at 12:20:07PM -0800, Eric Dumazet wrote: > On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote: > > On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote: > > > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote: > > > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > > > > > > > > > > Please post your new tcpdump then ;) > > > > > > > > > > also post "netstat -s" from root and test ns after your wgets > > > > > > > > Also try following bnx2 patch. > > > > > > > > It should help GRO / TCP coalesce > > > > > > > > bnx2 should be the last driver not using skb head_frag > > > > I don't have access to the host. I'm going to test your patch tomorrow. > > Thanks. > > > > > > > > And of course, you should make sure all your bnx2 interrupts are handled > > > by the same cpu. > > All bnx interrupts are handled on all cpus. They are handled on the same > > cpu, if a kernel is booted with msi_disable=1. > > > > Is it right, that a received window will be less, if packets are not sorted? > > Looks like a bug. > > > > I want to say, that probably it works correctly, if packets are sorted. > > But I think if packets are not sorted, it should work with the same > > speed, cpu load and memory consumption may be a bit more. > > Without veth, it doesnt really matter that IRQ are spread on multiple > cpus, because packets are handled in NAPI, and only one cpu runs the > eth0 NAPI handler at one time. > > But as soon as packets are queued (by netif_rx()) for 'later' > processing, you can have dramatic performance decrease. > > Thats why you really should make sure IRQ on your eth0 device > are handled by a single cpu. > > It will help to get better performance in most cases. I understand this fact, but so big difference looks strange for me. Default configuration (with the bug): # cat /proc/interrupts | grep eth0 68: 10187 10188 10187 10023 10190 10185 10187 10019 PCI-MSI-edge eth0 > > echo 1 >/proc/irq/*/eth0/../smp_affinity This doesn't help. I tryed echo 0 > /proc/irq/68/smp_affinity_list. This doesn't help too. > > If it doesnt work, you might try instead : > > echo 1 >/proc/irq/default_smp_affinity > <you might need to reload bnx2 module, or ifdown/ifup eth0 > This helps, and the bug are not reproduced in this case. # cat /proc/interrupts | grep eth0 68: 60777 0 0 0 0 0 0 0 PCI-MSI-edge eth0 Thanks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 20:08 ` Andrew Vagin 2012-12-29 20:20 ` Eric Dumazet @ 2012-12-29 21:12 ` Eric Dumazet 2012-12-29 21:19 ` Andrew Vagin 1 sibling, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2012-12-29 21:12 UTC (permalink / raw) To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote: > Is it right, that a received window will be less, if packets are not sorted? > Looks like a bug. Not really a bug. TCP is very sensitive to packet reorders. I wont elaborate here as its a bit off topic. Try to reorders credits/debits on your bank account, I am pretty sure you'll lose some money or even get serious troubles. Of course, enabling GRO on eth0 would definitely help a bit... (once/iff veth driver features are fixed to allow GSO packets being forwarded without being segmented again) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 21:12 ` Eric Dumazet @ 2012-12-29 21:19 ` Andrew Vagin 0 siblings, 0 replies; 16+ messages in thread From: Andrew Vagin @ 2012-12-29 21:19 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław On Sat, Dec 29, 2012 at 01:12:26PM -0800, Eric Dumazet wrote: > On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote: > > > Is it right, that a received window will be less, if packets are not sorted? > > Looks like a bug. > > Not really a bug. > > TCP is very sensitive to packet reorders. I wont elaborate here as > its a bit off topic. > > Try to reorders credits/debits on your bank account, I am pretty sure > you'll lose some money or even get serious troubles. > > Of course, enabling GRO on eth0 would definitely help a bit... > > (once/iff veth driver features are fixed to allow GSO packets being > forwarded without being segmented again) > Eric, thank you for the help. I need time for thinking. I will ask you, if new questions will appear. > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 18:58 ` Eric Dumazet 2012-12-29 19:41 ` Eric Dumazet @ 2012-12-29 21:15 ` Andrew Vagin 1 sibling, 0 replies; 16+ messages in thread From: Andrew Vagin @ 2012-12-29 21:15 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław On Sat, Dec 29, 2012 at 07:58:36PM +0100, Eric Dumazet wrote: > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit : > > > > > Please post your new tcpdump then ;) > > > > also post "netstat -s" from root and test ns after your wgets > > Also try following bnx2 patch. > > It should help GRO / TCP coalesce > > bnx2 should be the last driver not using skb head_frag > This patch breaks nothing. I don't know what kind of profit I should get with it:). FYI: I forgot to say, that I disable gro before collecting tcpdump, because in this case tcpdump from veth and from eth0 can be compared easier. > > diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c > index a1adfaf..08a2d40 100644 > --- a/drivers/net/ethernet/broadcom/bnx2.c > +++ b/drivers/net/ethernet/broadcom/bnx2.c > @@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index) > rx_pg->page = NULL; > } > > +static void bnx2_frag_free(const struct bnx2 *bp, void *data) > +{ > + if (bp->rx_frag_size) > + put_page(virt_to_head_page(data)); > + else > + kfree(data); > +} > + > static inline int > bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp) > { > @@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf > struct bnx2_rx_bd *rxbd = > &rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)]; > > - data = kmalloc(bp->rx_buf_size, gfp); > + if (bp->rx_frag_size) > + data = netdev_alloc_frag(bp->rx_frag_size); > + else > + data = kmalloc(bp->rx_buf_size, gfp); > if (!data) > return -ENOMEM; > > @@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf > bp->rx_buf_use_size, > PCI_DMA_FROMDEVICE); > if (dma_mapping_error(&bp->pdev->dev, mapping)) { > - kfree(data); > + bnx2_frag_free(bp, data); > return -EIO; > } > > @@ -3014,9 +3025,9 @@ error: > > dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size, > PCI_DMA_FROMDEVICE); > - skb = build_skb(data, 0); > + skb = build_skb(data, bp->rx_frag_size); > if (!skb) { > - kfree(data); > + bnx2_frag_free(bp, data); > goto error; > } > skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET); > @@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size) > /* hw alignment + build_skb() overhead*/ > bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) + > NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > + if (bp->rx_buf_size <= PAGE_SIZE) > + bp->rx_frag_size = bp->rx_buf_size; > + else > + bp->rx_frag_size = 0; > bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET; > bp->rx_ring_size = size; > bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS); > @@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp) > > rx_buf->data = NULL; > > - kfree(data); > + bnx2_frag_free(bp, data); > } > for (j = 0; j < bp->rx_max_pg_ring_idx; j++) > bnx2_free_rx_page(bp, rxr, j); > diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h > index 172efbe..11f5dee 100644 > --- a/drivers/net/ethernet/broadcom/bnx2.h > +++ b/drivers/net/ethernet/broadcom/bnx2.h > @@ -6804,6 +6804,7 @@ struct bnx2 { > > u32 rx_buf_use_size; /* useable size */ > u32 rx_buf_size; /* with alignment */ > + u32 rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */ > u32 rx_copy_thresh; > u32 rx_jumbo_thresh; > u32 rx_max_ring_idx; > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace 2012-12-29 13:53 ` Eric Dumazet 2012-12-29 14:50 ` Andrew Vagin @ 2012-12-29 16:01 ` Michał Mirosław 2012-12-30 2:26 ` [PATCH] veth: extend device features Eric Dumazet 1 sibling, 1 reply; 16+ messages in thread From: Michał Mirosław @ 2012-12-29 16:01 UTC (permalink / raw) To: Eric Dumazet; +Cc: Andrew Vagin, netdev, vvs, Michał Mirosław 2012/12/29 Eric Dumazet <eric.dumazet@gmail.com>: > On Sat, 2012-12-29 at 13:24 +0400, Andrew Vagin wrote: >> We found a few nodes, where network works slow in containers. >> >> For testing speed of TCP connections we use wget, which downloads iso >> images from the internet. >> >> wget in the new netns reports only 1.5 MB/s, but wget in the root netns >> reports 33MB/s. >> >> A few facts: >> * Experiments shows that window size for CT traffic does not increases >> up to ~900, however for host traffic window size increases up to ~14000 >> * packets are shuffled in the netns sometimes. >> * tso/gro/gso changes on interfaces does not help >> * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1 >> >> I reduced steps to reproduce: >> * Create a new network namespace "test" and a veth pair. >> # ip netns add test >> # ip link add name veth0 type veth peer name veth1 >> >> * Move veth1 into the netns test >> # ip link set veth1 netns test >> >> * Set ip address on veth1 and proper routing rules are added for this ip >> in the root netns. >> # ip link set up dev veth0; ip link set up dev veth0 >> # ip netns exec test ip a add REMOTE dev veth1 >> # ip netns exec test ip r a default via veth1 >> # ip r a REMOTE/32 via dev veth0 >> >> Tcpdump for both cases are attached to this message. >> tcpdump.host - wget in the root netns >> tcpdump.netns.host - tcpdump for the host device, wget in the new netns >> tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns >> >> 3.8-rc1 is used for experiments. >> >> Do you have any ideas where is a problem? > > veth has absolutely no offload features > > It needs some care... > > At the very miminum, let TCP coalesce do its job by allowing SG > > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights. veth is just like a tunnel device. In terms of offloads, it can do anything we have software fallbacks for (in case packets get forwarded to real hardware). > Please try following patch : > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > index 95814d9..9fefeb3 100644 > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = { > .ndo_set_mac_address = eth_mac_addr, > }; > > +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | \ > + NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | \ > + NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX) > + > static void veth_setup(struct net_device *dev) > { > ether_setup(dev); > @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev) > dev->netdev_ops = &veth_netdev_ops; > dev->ethtool_ops = &veth_ethtool_ops; > dev->features |= NETIF_F_LLTX; > + dev->features |= VETH_FEATURES; > dev->destructor = veth_dev_free; > > - dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM; > + dev->hw_features = VETH_FEATURES; > } You missed NETIF_F_RXCSUM in VETH_FEATURES. We might support NETIF_F_ALL_TSO, not just the IPv4 version. Best Regards, Michał Mirosław ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] veth: extend device features 2012-12-29 16:01 ` Michał Mirosław @ 2012-12-30 2:26 ` Eric Dumazet 2012-12-30 10:32 ` David Miller 0 siblings, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2012-12-30 2:26 UTC (permalink / raw) To: Michał Mirosław, David Miller Cc: Andrew Vagin, netdev, vvs, Michał Mirosław From: Eric Dumazet <edumazet@google.com> veth is lacking most modern facilities, like SG, checksums, TSO. It makes sense to extend dev->features to get them, or GRO aggregation is defeated by a forced segmentation. Reported-by: Andrew Vagin <avagin@parallels.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> --- drivers/net/veth.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 95814d9..ccf211f 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = { .ndo_set_mac_address = eth_mac_addr, }; +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \ + NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \ + NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX) + static void veth_setup(struct net_device *dev) { ether_setup(dev); @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev) dev->netdev_ops = &veth_netdev_ops; dev->ethtool_ops = &veth_ethtool_ops; dev->features |= NETIF_F_LLTX; + dev->features |= VETH_FEATURES; dev->destructor = veth_dev_free; - dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM; + dev->hw_features = VETH_FEATURES; } /* ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] veth: extend device features 2012-12-30 2:26 ` [PATCH] veth: extend device features Eric Dumazet @ 2012-12-30 10:32 ` David Miller 0 siblings, 0 replies; 16+ messages in thread From: David Miller @ 2012-12-30 10:32 UTC (permalink / raw) To: erdnetdev; +Cc: mirqus, avagin, netdev, vvs, mirq-linux From: Eric Dumazet <erdnetdev@gmail.com> Date: Sat, 29 Dec 2012 18:26:10 -0800 > From: Eric Dumazet <edumazet@google.com> > > veth is lacking most modern facilities, like SG, checksums, TSO. > > It makes sense to extend dev->features to get them, or GRO aggregation > is defeated by a forced segmentation. > > Reported-by: Andrew Vagin <avagin@parallels.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Applied. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2012-12-30 10:33 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-12-29 9:24 Slow speed of tcp connections in a network namespace Andrew Vagin 2012-12-29 13:53 ` Eric Dumazet 2012-12-29 14:50 ` Andrew Vagin 2012-12-29 17:40 ` Eric Dumazet 2012-12-29 18:29 ` Andrew Vagin 2012-12-29 18:58 ` Eric Dumazet 2012-12-29 19:41 ` Eric Dumazet 2012-12-29 20:08 ` Andrew Vagin 2012-12-29 20:20 ` Eric Dumazet 2012-12-29 21:07 ` Andrew Vagin 2012-12-29 21:12 ` Eric Dumazet 2012-12-29 21:19 ` Andrew Vagin 2012-12-29 21:15 ` Andrew Vagin 2012-12-29 16:01 ` Michał Mirosław 2012-12-30 2:26 ` [PATCH] veth: extend device features Eric Dumazet 2012-12-30 10:32 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).