Slow speed of tcp connections in a network namespace

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Slow speed of tcp connections in a network namespace
@ 2012-12-29  9:24 Andrew Vagin
  2012-12-29 13:53 ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29  9:24 UTC (permalink / raw)
  To: netdev; +Cc: vvs

[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]

We found a few nodes, where network works slow in containers.

For testing speed of TCP connections we use wget, which downloads iso
images from the internet.

wget in the new netns reports only 1.5 MB/s, but wget in the root netns
reports 33MB/s.

A few facts:
 * Experiments shows that window size for CT traffic does not increases
   up to ~900, however for host traffic window size increases up to ~14000
 * packets are shuffled in the netns sometimes.
 * tso/gro/gso changes on interfaces does not help
 * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1

I reduced steps to reproduce:
* Create a new network namespace "test" and a veth pair.
  # ip netns add test
  # ip link add name veth0 type veth peer name veth1

* Move veth1 into the netns test
  # ip link set veth1 netns test

* Set ip address on veth1 and proper routing rules are added for this ip
  in the root netns.
  # ip link set up dev veth0;  ip link set up dev veth0
  # ip netns exec test ip a add REMOTE dev veth1
  # ip netns exec test ip r a default via veth1
  # ip r a REMOTE/32 via dev veth0

Tcpdump for both cases are attached to this message.
tcpdump.host - wget in the root netns
tcpdump.netns.host - tcpdump for the host device, wget in the new netns
tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns

3.8-rc1 is used for experiments.

Do you have any ideas where is a problem?

[-- Attachment #2: tcpdump.host.gz --]
[-- Type: application/x-gzip, Size: 168126 bytes --]

[-- Attachment #3: tcpdump.netns.veth.gz --]
[-- Type: application/x-gzip, Size: 178809 bytes --]

[-- Attachment #4: tcpdump.netns.host.gz --]
[-- Type: application/x-gzip, Size: 178424 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29  9:24 Slow speed of tcp connections in a network namespace Andrew Vagin
@ 2012-12-29 13:53 ` Eric Dumazet
  2012-12-29 14:50   ` Andrew Vagin
  2012-12-29 16:01   ` Michał Mirosław
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 13:53 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław

On Sat, 2012-12-29 at 13:24 +0400, Andrew Vagin wrote:
> We found a few nodes, where network works slow in containers.
> 
> For testing speed of TCP connections we use wget, which downloads iso
> images from the internet.
> 
> wget in the new netns reports only 1.5 MB/s, but wget in the root netns
> reports 33MB/s.
> 
> A few facts:
>  * Experiments shows that window size for CT traffic does not increases
>    up to ~900, however for host traffic window size increases up to ~14000
>  * packets are shuffled in the netns sometimes.
>  * tso/gro/gso changes on interfaces does not help
>  * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1
> 
> I reduced steps to reproduce:
> * Create a new network namespace "test" and a veth pair.
>   # ip netns add test
>   # ip link add name veth0 type veth peer name veth1
> 
> * Move veth1 into the netns test
>   # ip link set veth1 netns test
> 
> * Set ip address on veth1 and proper routing rules are added for this ip
>   in the root netns.
>   # ip link set up dev veth0;  ip link set up dev veth0
>   # ip netns exec test ip a add REMOTE dev veth1
>   # ip netns exec test ip r a default via veth1
>   # ip r a REMOTE/32 via dev veth0
> 
> Tcpdump for both cases are attached to this message.
> tcpdump.host - wget in the root netns
> tcpdump.netns.host - tcpdump for the host device, wget in the new netns
> tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns
> 
> 3.8-rc1 is used for experiments.
> 
> Do you have any ideas where is a problem?

veth has absolutely no offload features

It needs some care...

At the very miminum, let TCP coalesce do its job by allowing SG

CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.

Please try following patch :

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 95814d9..9fefeb3 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_set_mac_address = eth_mac_addr,
 };
 
+#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO |	\
+		       NETIF_F_HW_CSUM | NETIF_F_HIGHDMA |		\
+		       NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
+
 static void veth_setup(struct net_device *dev)
 {
 	ether_setup(dev);
@@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
 	dev->netdev_ops = &veth_netdev_ops;
 	dev->ethtool_ops = &veth_ethtool_ops;
 	dev->features |= NETIF_F_LLTX;
+	dev->features |= VETH_FEATURES;
 	dev->destructor = veth_dev_free;
 
-	dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
+	dev->hw_features = VETH_FEATURES;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 13:53 ` Eric Dumazet
@ 2012-12-29 14:50   ` Andrew Vagin
  2012-12-29 17:40     ` Eric Dumazet
  2012-12-29 16:01   ` Michał Mirosław
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 14:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław

On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote:
> > 3.8-rc1 is used for experiments.
> > 
> > Do you have any ideas where is a problem?
> 
> veth has absolutely no offload features
> 
> It needs some care...
> 
> At the very miminum, let TCP coalesce do its job by allowing SG
> 
> CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
> 
> Please try following patch :

Hello Eric,

Thanks for your feedback.

With this patch the results is a bit better (~4MB/s), but it's much less
than in the root netns.

> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 95814d9..9fefeb3 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
>  	.ndo_set_mac_address = eth_mac_addr,
>  };
>  
> +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO |	\
> +		       NETIF_F_HW_CSUM | NETIF_F_HIGHDMA |		\
> +		       NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
> +
>  static void veth_setup(struct net_device *dev)
>  {
>  	ether_setup(dev);
> @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
>  	dev->netdev_ops = &veth_netdev_ops;
>  	dev->ethtool_ops = &veth_ethtool_ops;
>  	dev->features |= NETIF_F_LLTX;
> +	dev->features |= VETH_FEATURES;
>  	dev->destructor = veth_dev_free;
>  
> -	dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
> +	dev->hw_features = VETH_FEATURES;
>  }
>  
>  /*
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 14:50   ` Andrew Vagin
@ 2012-12-29 17:40     ` Eric Dumazet
  2012-12-29 18:29       ` Andrew Vagin
  2012-12-29 18:58       ` Eric Dumazet
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 17:40 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław

On Sat, 2012-12-29 at 18:50 +0400, Andrew Vagin wrote:
> On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote:
> > > 3.8-rc1 is used for experiments.
> > > 
> > > Do you have any ideas where is a problem?
> > 
> > veth has absolutely no offload features
> > 
> > It needs some care...
> > 
> > At the very miminum, let TCP coalesce do its job by allowing SG
> > 
> > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
> > 
> > Please try following patch :
> 
> Hello Eric,
> 
> Thanks for your feedback.
> 
> With this patch the results is a bit better (~4MB/s), but it's much less
> than in the root netns.

Please post your new tcpdump then ;)

also post "netstat -s" from root and test ns after your wgets

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 17:40     ` Eric Dumazet
@ 2012-12-29 18:29       ` Andrew Vagin
  2012-12-29 18:58       ` Eric Dumazet
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 18:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław

[-- Attachment #1: Type: text/plain, Size: 1014 bytes --]

On Sat, Dec 29, 2012 at 09:40:28AM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-29 at 18:50 +0400, Andrew Vagin wrote:
> > On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote:
> > > > 3.8-rc1 is used for experiments.
> > > > 
> > > > Do you have any ideas where is a problem?
> > > 
> > > veth has absolutely no offload features
> > > 
> > > It needs some care...
> > > 
> > > At the very miminum, let TCP coalesce do its job by allowing SG
> > > 
> > > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
> > > 
> > > Please try following patch :
> > 
> > Hello Eric,
> > 
> > Thanks for your feedback.
> > 
> > With this patch the results is a bit better (~4MB/s), but it's much less
> > than in the root netns.
> 
> Please post your new tcpdump then ;)

I have rebooted the host and a speed in a netns is again about 1.7MB/s. I
don't know why it was 4MB/s in the previous time.

new tcpdump and netstat are attached

> 
> also post "netstat -s" from root and test ns after your wgets
> 
> 
> 

[-- Attachment #2: tcpdump.host.gz --]
[-- Type: application/x-gzip, Size: 165716 bytes --]

[-- Attachment #3: tcpdump.netns.host.gz --]
[-- Type: application/x-gzip, Size: 180703 bytes --]

[-- Attachment #4: tcpdump.netns.veth.gz --]
[-- Type: application/x-gzip, Size: 181615 bytes --]

[-- Attachment #5: netstat.host --]
[-- Type: text/plain, Size: 1821 bytes --]

Ip:
    277536 total packets received
    20 forwarded
    0 incoming packets discarded
    202326 incoming packets delivered
    108228 requests sent out
    30 dropped because of missing route
Icmp:
    10 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 5
        echo requests: 5
    6 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 1
        echo replies: 5
IcmpMsg:
        InType3: 5
        InType8: 5
        OutType0: 5
        OutType3: 1
Tcp:
    1491 active connections openings
    12 passive connection openings
    14 failed connection attempts
    72 connection resets received
    2 connections established
    201920 segments received
    107815 segments send out
    0 segments retransmited
    0 bad segments received.
    1338 resets sent
Udp:
    387 packets received
    0 packets to unknown port received.
    0 packet receive errors
    389 packets sent
UdpLite:
TcpExt:
    3 invalid SYN cookies received
    4 TCP sockets finished time wait in fast timer
    63 delayed acks sent
    9 delayed acks further delayed because of locked socket
    236 packets directly queued to recvmsg prequeue.
    38600456 packets directly received from backlog
    298101 packets directly received from prequeue
    163501 packets header predicted
    27103 packets header predicted and directly queued to user
    2578 acknowledgments not containing data received
    1018 predicted acknowledgments
    15 connections reset due to unexpected data
    72 connections reset due to early user close
    TCPRcvCoalesce: 123
    TCPOFOQueue: 1187
IpExt:
    InBcastPkts: 9
    OutBcastPkts: 1
    InOctets: 296395504
    OutOctets: 6965311
    InBcastOctets: 2789
    OutBcastOctets: 165

[-- Attachment #6: netstat.netns --]
[-- Type: text/plain, Size: 1463 bytes --]

Ip:
    25483 total packets received
    0 forwarded
    0 incoming packets discarded
    25483 incoming packets delivered
    14572 requests sent out
Icmp:
    4 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        echo requests: 2
        echo replies: 2
    4 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        echo request: 2
        echo replies: 2
IcmpMsg:
        InType0: 2
        InType8: 2
        OutType0: 2
        OutType8: 2
Tcp:
    1 active connections openings
    0 passive connection openings
    0 failed connection attempts
    0 connection resets received
    0 connections established
    25473 segments received
    14562 segments send out
    0 segments retransmited
    0 bad segments received.
    77 resets sent
Udp:
    6 packets received
    0 packets to unknown port received.
    0 packet receive errors
    6 packets sent
UdpLite:
TcpExt:
    38 delayed acks sent
    Quick ack mode was activated 2 times
    52 packets directly queued to recvmsg prequeue.
    4916752 packets directly received from backlog
    11584 packets directly received from prequeue
    12538 packets header predicted
    2649 packets header predicted and directly queued to user
    1 acknowledgments not containing data received
    2 DSACKs sent for old packets
    1 connections reset due to unexpected data
    TCPOFOQueue: 1580
IpExt:
    InOctets: 38201843
    OutOctets: 829966

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 17:40     ` Eric Dumazet
  2012-12-29 18:29       ` Andrew Vagin
@ 2012-12-29 18:58       ` Eric Dumazet
  2012-12-29 19:41         ` Eric Dumazet
  2012-12-29 21:15         ` Andrew Vagin
  1 sibling, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 18:58 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław

Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :

> 
> Please post your new tcpdump then ;)
> 
> also post "netstat -s" from root and test ns after your wgets

Also try following bnx2 patch.

It should help GRO / TCP coalesce

bnx2 should be the last driver not using skb head_frag


diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index a1adfaf..08a2d40 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
 	rx_pg->page = NULL;
 }
 
+static void bnx2_frag_free(const struct bnx2 *bp, void *data)
+{
+	if (bp->rx_frag_size)
+		put_page(virt_to_head_page(data));
+	else
+		kfree(data);
+}
+
 static inline int
 bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp)
 {
@@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
 	struct bnx2_rx_bd *rxbd =
 		&rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)];
 
-	data = kmalloc(bp->rx_buf_size, gfp);
+	if (bp->rx_frag_size)
+		data = netdev_alloc_frag(bp->rx_frag_size);
+	else
+		data = kmalloc(bp->rx_buf_size, gfp);
 	if (!data)
 		return -ENOMEM;
 
@@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
 				 bp->rx_buf_use_size,
 				 PCI_DMA_FROMDEVICE);
 	if (dma_mapping_error(&bp->pdev->dev, mapping)) {
-		kfree(data);
+		bnx2_frag_free(bp, data);
 		return -EIO;
 	}
 
@@ -3014,9 +3025,9 @@ error:
 
 	dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
 			 PCI_DMA_FROMDEVICE);
-	skb = build_skb(data, 0);
+	skb = build_skb(data, bp->rx_frag_size);
 	if (!skb) {
-		kfree(data);
+		bnx2_frag_free(bp, data);
 		goto error;
 	}
 	skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET);
@@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 	/* hw alignment + build_skb() overhead*/
 	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
 		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	if (bp->rx_buf_size <= PAGE_SIZE)
+		bp->rx_frag_size = bp->rx_buf_size;
+	else
+		bp->rx_frag_size = 0;
 	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
 	bp->rx_ring_size = size;
 	bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS);
@@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp)
 
 			rx_buf->data = NULL;
 
-			kfree(data);
+			bnx2_frag_free(bp, data);
 		}
 		for (j = 0; j < bp->rx_max_pg_ring_idx; j++)
 			bnx2_free_rx_page(bp, rxr, j);
diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h
index 172efbe..11f5dee 100644
--- a/drivers/net/ethernet/broadcom/bnx2.h
+++ b/drivers/net/ethernet/broadcom/bnx2.h
@@ -6804,6 +6804,7 @@ struct bnx2 {
 
 	u32			rx_buf_use_size;	/* useable size */
 	u32			rx_buf_size;		/* with alignment */
+	u32			rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */
 	u32			rx_copy_thresh;
 	u32			rx_jumbo_thresh;
 	u32			rx_max_ring_idx;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 18:58       ` Eric Dumazet
@ 2012-12-29 19:41         ` Eric Dumazet
  2012-12-29 20:08           ` Andrew Vagin
  2012-12-29 21:15         ` Andrew Vagin
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 19:41 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław

On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> 
> > 
> > Please post your new tcpdump then ;)
> > 
> > also post "netstat -s" from root and test ns after your wgets
> 
> Also try following bnx2 patch.
> 
> It should help GRO / TCP coalesce
> 
> bnx2 should be the last driver not using skb head_frag

And of course, you should make sure all your bnx2 interrupts are handled
by the same cpu.

Or else, packets might be reordered because the way dev_forward_skb()
works.

(CPU X gets a bunch of packets from eth0, forward them via netif_rx() in
the local CPU X queue, NAPI is ended on eth0)

CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in
the local CPU Y queue.

CPU X and Y process their local queue in // -> packets are delivered Out
of order to TCP stack

Alternative is to setup RPS on your veth1 device, to force packets being
delivered/handled by a given cpu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 19:41         ` Eric Dumazet
@ 2012-12-29 20:08           ` Andrew Vagin
  2012-12-29 20:20             ` Eric Dumazet
  2012-12-29 21:12             ` Eric Dumazet
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 20:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław

On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > 
> > > 
> > > Please post your new tcpdump then ;)
> > > 
> > > also post "netstat -s" from root and test ns after your wgets
> > 
> > Also try following bnx2 patch.
> > 
> > It should help GRO / TCP coalesce
> > 
> > bnx2 should be the last driver not using skb head_frag

I don't have access to the host. I'm going to test your patch tomorrow.
Thanks.

> 
> And of course, you should make sure all your bnx2 interrupts are handled
> by the same cpu.
All bnx interrupts are handled on all cpus. They are handled on the same
cpu, if a kernel is booted with msi_disable=1.

Is it right, that a received window will be less, if packets are not sorted?
Looks like a bug.

I want to say, that probably it works correctly, if packets are sorted.
But I think if packets are not sorted, it should work with the same
speed, cpu load and memory consumption may be a bit more.

> 
> Or else, packets might be reordered because the way dev_forward_skb()
> works.
> 
> (CPU X gets a bunch of packets from eth0, forward them via netif_rx() in
> the local CPU X queue, NAPI is ended on eth0)
> 
> CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in
> the local CPU Y queue.
> 
> CPU X and Y process their local queue in // -> packets are delivered Out
> of order to TCP stack
> 
> Alternative is to setup RPS on your veth1 device, to force packets being
> delivered/handled by a given cpu
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 20:08           ` Andrew Vagin
@ 2012-12-29 20:20             ` Eric Dumazet
  2012-12-29 21:07               ` Andrew Vagin
  2012-12-29 21:12             ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 20:20 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław

On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > > 
> > > > 
> > > > Please post your new tcpdump then ;)
> > > > 
> > > > also post "netstat -s" from root and test ns after your wgets
> > > 
> > > Also try following bnx2 patch.
> > > 
> > > It should help GRO / TCP coalesce
> > > 
> > > bnx2 should be the last driver not using skb head_frag
> 
> I don't have access to the host. I'm going to test your patch tomorrow.
> Thanks.
> 
> > 
> > And of course, you should make sure all your bnx2 interrupts are handled
> > by the same cpu.
> All bnx interrupts are handled on all cpus. They are handled on the same
> cpu, if a kernel is booted with msi_disable=1.
> 
> Is it right, that a received window will be less, if packets are not sorted?
> Looks like a bug.
> 
> I want to say, that probably it works correctly, if packets are sorted.
> But I think if packets are not sorted, it should work with the same
> speed, cpu load and memory consumption may be a bit more.

Without veth, it doesnt really matter that IRQ are spread on multiple
cpus, because packets are handled in NAPI, and only one cpu runs the
eth0 NAPI handler at one time.

But as soon as packets are queued (by netif_rx()) for 'later'
processing, you can have dramatic performance decrease.

Thats why you really should make sure IRQ on your eth0 device
are handled by a single cpu.

It will help to get better performance in most cases.

echo 1 >/proc/irq/*/eth0/../smp_affinity

If it doesnt work, you might try instead :

echo 1 >/proc/irq/default_smp_affinity
<you might need to reload bnx2 module, or ifdown/ifup eth0 >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 20:20             ` Eric Dumazet
@ 2012-12-29 21:07               ` Andrew Vagin
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 21:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław

On Sat, Dec 29, 2012 at 12:20:07PM -0800, Eric Dumazet wrote:
> On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> > On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> > > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > > > 
> > > > > 
> > > > > Please post your new tcpdump then ;)
> > > > > 
> > > > > also post "netstat -s" from root and test ns after your wgets
> > > > 
> > > > Also try following bnx2 patch.
> > > > 
> > > > It should help GRO / TCP coalesce
> > > > 
> > > > bnx2 should be the last driver not using skb head_frag
> > 
> > I don't have access to the host. I'm going to test your patch tomorrow.
> > Thanks.
> > 
> > > 
> > > And of course, you should make sure all your bnx2 interrupts are handled
> > > by the same cpu.
> > All bnx interrupts are handled on all cpus. They are handled on the same
> > cpu, if a kernel is booted with msi_disable=1.
> > 
> > Is it right, that a received window will be less, if packets are not sorted?
> > Looks like a bug.
> > 
> > I want to say, that probably it works correctly, if packets are sorted.
> > But I think if packets are not sorted, it should work with the same
> > speed, cpu load and memory consumption may be a bit more.
> 
> Without veth, it doesnt really matter that IRQ are spread on multiple
> cpus, because packets are handled in NAPI, and only one cpu runs the
> eth0 NAPI handler at one time.
> 
> But as soon as packets are queued (by netif_rx()) for 'later'
> processing, you can have dramatic performance decrease.
> 
> Thats why you really should make sure IRQ on your eth0 device
> are handled by a single cpu.
> 
> It will help to get better performance in most cases.

I understand this fact, but so big difference looks strange for me.

Default configuration (with the bug):
# cat /proc/interrupts  | grep eth0
  68:      10187      10188      10187      10023      10190      10185
10187      10019   PCI-MSI-edge      eth0

> 
> echo 1 >/proc/irq/*/eth0/../smp_affinity

This doesn't help.

I tryed echo 0 > /proc/irq/68/smp_affinity_list. This doesn't help too.

> 
> If it doesnt work, you might try instead :
> 
> echo 1 >/proc/irq/default_smp_affinity
> <you might need to reload bnx2 module, or ifdown/ifup eth0 >

This helps, and the bug are not reproduced in this case.

# cat /proc/interrupts  | grep eth0
  68:      60777          0          0          0          0          0
0          0   PCI-MSI-edge      eth0

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 20:08           ` Andrew Vagin
  2012-12-29 20:20             ` Eric Dumazet
@ 2012-12-29 21:12             ` Eric Dumazet
  2012-12-29 21:19               ` Andrew Vagin
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 21:12 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław

On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:

> Is it right, that a received window will be less, if packets are not sorted?
> Looks like a bug.

Not really a bug.

TCP is very sensitive to packet reorders. I wont elaborate here as
its a bit off topic.

Try to reorders credits/debits on your bank account, I am pretty sure
you'll lose some money or even get serious troubles.

Of course, enabling GRO on eth0 would definitely help a bit...

(once/iff veth driver features are fixed to allow GSO packets being
forwarded without being segmented again)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 21:12             ` Eric Dumazet
@ 2012-12-29 21:19               ` Andrew Vagin
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 21:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław

On Sat, Dec 29, 2012 at 01:12:26PM -0800, Eric Dumazet wrote:
> On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> 
> > Is it right, that a received window will be less, if packets are not sorted?
> > Looks like a bug.
> 
> Not really a bug.
> 
> TCP is very sensitive to packet reorders. I wont elaborate here as
> its a bit off topic.
> 
> Try to reorders credits/debits on your bank account, I am pretty sure
> you'll lose some money or even get serious troubles.
> 
> Of course, enabling GRO on eth0 would definitely help a bit...
> 
> (once/iff veth driver features are fixed to allow GSO packets being
> forwarded without being segmented again)
> 

Eric, thank you for the help.
I need time for thinking. I will ask you, if new questions will appear.

> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 18:58       ` Eric Dumazet
  2012-12-29 19:41         ` Eric Dumazet
@ 2012-12-29 21:15         ` Andrew Vagin
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 21:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław

On Sat, Dec 29, 2012 at 07:58:36PM +0100, Eric Dumazet wrote:
> Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> 
> > 
> > Please post your new tcpdump then ;)
> > 
> > also post "netstat -s" from root and test ns after your wgets
> 
> Also try following bnx2 patch.
> 
> It should help GRO / TCP coalesce
> 
> bnx2 should be the last driver not using skb head_frag
> 

This patch breaks nothing. I don't know what kind of profit I should get
with it:).

FYI:
I forgot to say, that I disable gro before collecting tcpdump, because
in this case tcpdump from veth and from eth0 can be compared easier.

> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
> index a1adfaf..08a2d40 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.c
> +++ b/drivers/net/ethernet/broadcom/bnx2.c
> @@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
>  	rx_pg->page = NULL;
>  }
>  
> +static void bnx2_frag_free(const struct bnx2 *bp, void *data)
> +{
> +	if (bp->rx_frag_size)
> +		put_page(virt_to_head_page(data));
> +	else
> +		kfree(data);
> +}
> +
>  static inline int
>  bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp)
>  {
> @@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
>  	struct bnx2_rx_bd *rxbd =
>  		&rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)];
>  
> -	data = kmalloc(bp->rx_buf_size, gfp);
> +	if (bp->rx_frag_size)
> +		data = netdev_alloc_frag(bp->rx_frag_size);
> +	else
> +		data = kmalloc(bp->rx_buf_size, gfp);
>  	if (!data)
>  		return -ENOMEM;
>  
> @@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
>  				 bp->rx_buf_use_size,
>  				 PCI_DMA_FROMDEVICE);
>  	if (dma_mapping_error(&bp->pdev->dev, mapping)) {
> -		kfree(data);
> +		bnx2_frag_free(bp, data);
>  		return -EIO;
>  	}
>  
> @@ -3014,9 +3025,9 @@ error:
>  
>  	dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
>  			 PCI_DMA_FROMDEVICE);
> -	skb = build_skb(data, 0);
> +	skb = build_skb(data, bp->rx_frag_size);
>  	if (!skb) {
> -		kfree(data);
> +		bnx2_frag_free(bp, data);
>  		goto error;
>  	}
>  	skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET);
> @@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
>  	/* hw alignment + build_skb() overhead*/
>  	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
>  		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	if (bp->rx_buf_size <= PAGE_SIZE)
> +		bp->rx_frag_size = bp->rx_buf_size;
> +	else
> +		bp->rx_frag_size = 0;
>  	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
>  	bp->rx_ring_size = size;
>  	bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS);
> @@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp)
>  
>  			rx_buf->data = NULL;
>  
> -			kfree(data);
> +			bnx2_frag_free(bp, data);
>  		}
>  		for (j = 0; j < bp->rx_max_pg_ring_idx; j++)
>  			bnx2_free_rx_page(bp, rxr, j);
> diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h
> index 172efbe..11f5dee 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.h
> +++ b/drivers/net/ethernet/broadcom/bnx2.h
> @@ -6804,6 +6804,7 @@ struct bnx2 {
>  
>  	u32			rx_buf_use_size;	/* useable size */
>  	u32			rx_buf_size;		/* with alignment */
> +	u32			rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */
>  	u32			rx_copy_thresh;
>  	u32			rx_jumbo_thresh;
>  	u32			rx_max_ring_idx;
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Slow speed of tcp connections in a network namespace
  2012-12-29 13:53 ` Eric Dumazet
  2012-12-29 14:50   ` Andrew Vagin
@ 2012-12-29 16:01   ` Michał Mirosław
  2012-12-30  2:26     ` [PATCH] veth: extend device features Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: Michał Mirosław @ 2012-12-29 16:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Vagin, netdev, vvs, Michał Mirosław

2012/12/29 Eric Dumazet <eric.dumazet@gmail.com>:
> On Sat, 2012-12-29 at 13:24 +0400, Andrew Vagin wrote:
>> We found a few nodes, where network works slow in containers.
>>
>> For testing speed of TCP connections we use wget, which downloads iso
>> images from the internet.
>>
>> wget in the new netns reports only 1.5 MB/s, but wget in the root netns
>> reports 33MB/s.
>>
>> A few facts:
>>  * Experiments shows that window size for CT traffic does not increases
>>    up to ~900, however for host traffic window size increases up to ~14000
>>  * packets are shuffled in the netns sometimes.
>>  * tso/gro/gso changes on interfaces does not help
>>  * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1
>>
>> I reduced steps to reproduce:
>> * Create a new network namespace "test" and a veth pair.
>>   # ip netns add test
>>   # ip link add name veth0 type veth peer name veth1
>>
>> * Move veth1 into the netns test
>>   # ip link set veth1 netns test
>>
>> * Set ip address on veth1 and proper routing rules are added for this ip
>>   in the root netns.
>>   # ip link set up dev veth0;  ip link set up dev veth0
>>   # ip netns exec test ip a add REMOTE dev veth1
>>   # ip netns exec test ip r a default via veth1
>>   # ip r a REMOTE/32 via dev veth0
>>
>> Tcpdump for both cases are attached to this message.
>> tcpdump.host - wget in the root netns
>> tcpdump.netns.host - tcpdump for the host device, wget in the new netns
>> tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns
>>
>> 3.8-rc1 is used for experiments.
>>
>> Do you have any ideas where is a problem?
>
> veth has absolutely no offload features
>
> It needs some care...
>
> At the very miminum, let TCP coalesce do its job by allowing SG
>
> CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.

veth is just like a tunnel device. In terms of offloads, it can do anything
we have software fallbacks for (in case packets get forwarded to real hardware).

> Please try following patch :
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 95814d9..9fefeb3 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
>         .ndo_set_mac_address = eth_mac_addr,
>  };
>
> +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO |   \
> +                      NETIF_F_HW_CSUM | NETIF_F_HIGHDMA |              \
> +                      NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
> +
>  static void veth_setup(struct net_device *dev)
>  {
>         ether_setup(dev);
> @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
>         dev->netdev_ops = &veth_netdev_ops;
>         dev->ethtool_ops = &veth_ethtool_ops;
>         dev->features |= NETIF_F_LLTX;
> +       dev->features |= VETH_FEATURES;
>         dev->destructor = veth_dev_free;
>
> -       dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
> +       dev->hw_features = VETH_FEATURES;
>  }

You missed NETIF_F_RXCSUM in VETH_FEATURES. We might support
NETIF_F_ALL_TSO, not just the IPv4 version.

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] veth: extend device features
  2012-12-29 16:01   ` Michał Mirosław
@ 2012-12-30  2:26     ` Eric Dumazet
  2012-12-30 10:32       ` David Miller
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2012-12-30  2:26 UTC (permalink / raw)
  To: Michał Mirosław, David Miller
  Cc: Andrew Vagin, netdev, vvs, Michał Mirosław

From: Eric Dumazet <edumazet@google.com>

veth is lacking most modern facilities, like SG, checksums, TSO.

It makes sense to extend dev->features to get them, or GRO aggregation
is defeated by a forced segmentation.

Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 drivers/net/veth.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 95814d9..ccf211f 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_set_mac_address = eth_mac_addr,
 };
 
+#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |    \
+		       NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
+		       NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
+
 static void veth_setup(struct net_device *dev)
 {
 	ether_setup(dev);
@@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
 	dev->netdev_ops = &veth_netdev_ops;
 	dev->ethtool_ops = &veth_ethtool_ops;
 	dev->features |= NETIF_F_LLTX;
+	dev->features |= VETH_FEATURES;
 	dev->destructor = veth_dev_free;
 
-	dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
+	dev->hw_features = VETH_FEATURES;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] veth: extend device features
  2012-12-30  2:26     ` [PATCH] veth: extend device features Eric Dumazet
@ 2012-12-30 10:32       ` David Miller
  0 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2012-12-30 10:32 UTC (permalink / raw)
  To: erdnetdev; +Cc: mirqus, avagin, netdev, vvs, mirq-linux

From: Eric Dumazet <erdnetdev@gmail.com>
Date: Sat, 29 Dec 2012 18:26:10 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> veth is lacking most modern facilities, like SG, checksums, TSO.
> 
> It makes sense to extend dev->features to get them, or GRO aggregation
> is defeated by a forced segmentation.
> 
> Reported-by: Andrew Vagin <avagin@parallels.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Applied.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-12-30 10:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-29  9:24 Slow speed of tcp connections in a network namespace Andrew Vagin
2012-12-29 13:53 ` Eric Dumazet
2012-12-29 14:50   ` Andrew Vagin
2012-12-29 17:40     ` Eric Dumazet
2012-12-29 18:29       ` Andrew Vagin
2012-12-29 18:58       ` Eric Dumazet
2012-12-29 19:41         ` Eric Dumazet
2012-12-29 20:08           ` Andrew Vagin
2012-12-29 20:20             ` Eric Dumazet
2012-12-29 21:07               ` Andrew Vagin
2012-12-29 21:12             ` Eric Dumazet
2012-12-29 21:19               ` Andrew Vagin
2012-12-29 21:15         ` Andrew Vagin
2012-12-29 16:01   ` Michał Mirosław
2012-12-30  2:26     ` [PATCH] veth: extend device features Eric Dumazet
2012-12-30 10:32       ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).