* Slow speed of tcp connections in a network namespace
@ 2012-12-29 9:24 Andrew Vagin
2012-12-29 13:53 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 9:24 UTC (permalink / raw)
To: netdev; +Cc: vvs
[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]
We found a few nodes, where network works slow in containers.
For testing speed of TCP connections we use wget, which downloads iso
images from the internet.
wget in the new netns reports only 1.5 MB/s, but wget in the root netns
reports 33MB/s.
A few facts:
* Experiments shows that window size for CT traffic does not increases
up to ~900, however for host traffic window size increases up to ~14000
* packets are shuffled in the netns sometimes.
* tso/gro/gso changes on interfaces does not help
* issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1
I reduced steps to reproduce:
* Create a new network namespace "test" and a veth pair.
# ip netns add test
# ip link add name veth0 type veth peer name veth1
* Move veth1 into the netns test
# ip link set veth1 netns test
* Set ip address on veth1 and proper routing rules are added for this ip
in the root netns.
# ip link set up dev veth0; ip link set up dev veth0
# ip netns exec test ip a add REMOTE dev veth1
# ip netns exec test ip r a default via veth1
# ip r a REMOTE/32 via dev veth0
Tcpdump for both cases are attached to this message.
tcpdump.host - wget in the root netns
tcpdump.netns.host - tcpdump for the host device, wget in the new netns
tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns
3.8-rc1 is used for experiments.
Do you have any ideas where is a problem?
[-- Attachment #2: tcpdump.host.gz --]
[-- Type: application/x-gzip, Size: 168126 bytes --]
[-- Attachment #3: tcpdump.netns.veth.gz --]
[-- Type: application/x-gzip, Size: 178809 bytes --]
[-- Attachment #4: tcpdump.netns.host.gz --]
[-- Type: application/x-gzip, Size: 178424 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 9:24 Slow speed of tcp connections in a network namespace Andrew Vagin
@ 2012-12-29 13:53 ` Eric Dumazet
2012-12-29 14:50 ` Andrew Vagin
2012-12-29 16:01 ` Michał Mirosław
0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 13:53 UTC (permalink / raw)
To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
On Sat, 2012-12-29 at 13:24 +0400, Andrew Vagin wrote:
> We found a few nodes, where network works slow in containers.
>
> For testing speed of TCP connections we use wget, which downloads iso
> images from the internet.
>
> wget in the new netns reports only 1.5 MB/s, but wget in the root netns
> reports 33MB/s.
>
> A few facts:
> * Experiments shows that window size for CT traffic does not increases
> up to ~900, however for host traffic window size increases up to ~14000
> * packets are shuffled in the netns sometimes.
> * tso/gro/gso changes on interfaces does not help
> * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1
>
> I reduced steps to reproduce:
> * Create a new network namespace "test" and a veth pair.
> # ip netns add test
> # ip link add name veth0 type veth peer name veth1
>
> * Move veth1 into the netns test
> # ip link set veth1 netns test
>
> * Set ip address on veth1 and proper routing rules are added for this ip
> in the root netns.
> # ip link set up dev veth0; ip link set up dev veth0
> # ip netns exec test ip a add REMOTE dev veth1
> # ip netns exec test ip r a default via veth1
> # ip r a REMOTE/32 via dev veth0
>
> Tcpdump for both cases are attached to this message.
> tcpdump.host - wget in the root netns
> tcpdump.netns.host - tcpdump for the host device, wget in the new netns
> tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns
>
> 3.8-rc1 is used for experiments.
>
> Do you have any ideas where is a problem?
veth has absolutely no offload features
It needs some care...
At the very miminum, let TCP coalesce do its job by allowing SG
CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
Please try following patch :
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 95814d9..9fefeb3 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_set_mac_address = eth_mac_addr,
};
+#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | \
+ NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | \
+ NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
+
static void veth_setup(struct net_device *dev)
{
ether_setup(dev);
@@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
dev->netdev_ops = &veth_netdev_ops;
dev->ethtool_ops = &veth_ethtool_ops;
dev->features |= NETIF_F_LLTX;
+ dev->features |= VETH_FEATURES;
dev->destructor = veth_dev_free;
- dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
+ dev->hw_features = VETH_FEATURES;
}
/*
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 13:53 ` Eric Dumazet
@ 2012-12-29 14:50 ` Andrew Vagin
2012-12-29 17:40 ` Eric Dumazet
2012-12-29 16:01 ` Michał Mirosław
1 sibling, 1 reply; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 14:50 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote:
> > 3.8-rc1 is used for experiments.
> >
> > Do you have any ideas where is a problem?
>
> veth has absolutely no offload features
>
> It needs some care...
>
> At the very miminum, let TCP coalesce do its job by allowing SG
>
> CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
>
> Please try following patch :
Hello Eric,
Thanks for your feedback.
With this patch the results is a bit better (~4MB/s), but it's much less
than in the root netns.
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 95814d9..9fefeb3 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
> .ndo_set_mac_address = eth_mac_addr,
> };
>
> +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | \
> + NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | \
> + NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
> +
> static void veth_setup(struct net_device *dev)
> {
> ether_setup(dev);
> @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
> dev->netdev_ops = &veth_netdev_ops;
> dev->ethtool_ops = &veth_ethtool_ops;
> dev->features |= NETIF_F_LLTX;
> + dev->features |= VETH_FEATURES;
> dev->destructor = veth_dev_free;
>
> - dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
> + dev->hw_features = VETH_FEATURES;
> }
>
> /*
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 13:53 ` Eric Dumazet
2012-12-29 14:50 ` Andrew Vagin
@ 2012-12-29 16:01 ` Michał Mirosław
2012-12-30 2:26 ` [PATCH] veth: extend device features Eric Dumazet
1 sibling, 1 reply; 16+ messages in thread
From: Michał Mirosław @ 2012-12-29 16:01 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Andrew Vagin, netdev, vvs, Michał Mirosław
2012/12/29 Eric Dumazet <eric.dumazet@gmail.com>:
> On Sat, 2012-12-29 at 13:24 +0400, Andrew Vagin wrote:
>> We found a few nodes, where network works slow in containers.
>>
>> For testing speed of TCP connections we use wget, which downloads iso
>> images from the internet.
>>
>> wget in the new netns reports only 1.5 MB/s, but wget in the root netns
>> reports 33MB/s.
>>
>> A few facts:
>> * Experiments shows that window size for CT traffic does not increases
>> up to ~900, however for host traffic window size increases up to ~14000
>> * packets are shuffled in the netns sometimes.
>> * tso/gro/gso changes on interfaces does not help
>> * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1
>>
>> I reduced steps to reproduce:
>> * Create a new network namespace "test" and a veth pair.
>> # ip netns add test
>> # ip link add name veth0 type veth peer name veth1
>>
>> * Move veth1 into the netns test
>> # ip link set veth1 netns test
>>
>> * Set ip address on veth1 and proper routing rules are added for this ip
>> in the root netns.
>> # ip link set up dev veth0; ip link set up dev veth0
>> # ip netns exec test ip a add REMOTE dev veth1
>> # ip netns exec test ip r a default via veth1
>> # ip r a REMOTE/32 via dev veth0
>>
>> Tcpdump for both cases are attached to this message.
>> tcpdump.host - wget in the root netns
>> tcpdump.netns.host - tcpdump for the host device, wget in the new netns
>> tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns
>>
>> 3.8-rc1 is used for experiments.
>>
>> Do you have any ideas where is a problem?
>
> veth has absolutely no offload features
>
> It needs some care...
>
> At the very miminum, let TCP coalesce do its job by allowing SG
>
> CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
veth is just like a tunnel device. In terms of offloads, it can do anything
we have software fallbacks for (in case packets get forwarded to real hardware).
> Please try following patch :
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 95814d9..9fefeb3 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
> .ndo_set_mac_address = eth_mac_addr,
> };
>
> +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | \
> + NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | \
> + NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
> +
> static void veth_setup(struct net_device *dev)
> {
> ether_setup(dev);
> @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
> dev->netdev_ops = &veth_netdev_ops;
> dev->ethtool_ops = &veth_ethtool_ops;
> dev->features |= NETIF_F_LLTX;
> + dev->features |= VETH_FEATURES;
> dev->destructor = veth_dev_free;
>
> - dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
> + dev->hw_features = VETH_FEATURES;
> }
You missed NETIF_F_RXCSUM in VETH_FEATURES. We might support
NETIF_F_ALL_TSO, not just the IPv4 version.
Best Regards,
Michał Mirosław
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 14:50 ` Andrew Vagin
@ 2012-12-29 17:40 ` Eric Dumazet
2012-12-29 18:29 ` Andrew Vagin
2012-12-29 18:58 ` Eric Dumazet
0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 17:40 UTC (permalink / raw)
To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
On Sat, 2012-12-29 at 18:50 +0400, Andrew Vagin wrote:
> On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote:
> > > 3.8-rc1 is used for experiments.
> > >
> > > Do you have any ideas where is a problem?
> >
> > veth has absolutely no offload features
> >
> > It needs some care...
> >
> > At the very miminum, let TCP coalesce do its job by allowing SG
> >
> > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
> >
> > Please try following patch :
>
> Hello Eric,
>
> Thanks for your feedback.
>
> With this patch the results is a bit better (~4MB/s), but it's much less
> than in the root netns.
Please post your new tcpdump then ;)
also post "netstat -s" from root and test ns after your wgets
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 17:40 ` Eric Dumazet
@ 2012-12-29 18:29 ` Andrew Vagin
2012-12-29 18:58 ` Eric Dumazet
1 sibling, 0 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 18:29 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
[-- Attachment #1: Type: text/plain, Size: 1014 bytes --]
On Sat, Dec 29, 2012 at 09:40:28AM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-29 at 18:50 +0400, Andrew Vagin wrote:
> > On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote:
> > > > 3.8-rc1 is used for experiments.
> > > >
> > > > Do you have any ideas where is a problem?
> > >
> > > veth has absolutely no offload features
> > >
> > > It needs some care...
> > >
> > > At the very miminum, let TCP coalesce do its job by allowing SG
> > >
> > > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
> > >
> > > Please try following patch :
> >
> > Hello Eric,
> >
> > Thanks for your feedback.
> >
> > With this patch the results is a bit better (~4MB/s), but it's much less
> > than in the root netns.
>
> Please post your new tcpdump then ;)
I have rebooted the host and a speed in a netns is again about 1.7MB/s. I
don't know why it was 4MB/s in the previous time.
new tcpdump and netstat are attached
>
> also post "netstat -s" from root and test ns after your wgets
>
>
>
[-- Attachment #2: tcpdump.host.gz --]
[-- Type: application/x-gzip, Size: 165716 bytes --]
[-- Attachment #3: tcpdump.netns.host.gz --]
[-- Type: application/x-gzip, Size: 180703 bytes --]
[-- Attachment #4: tcpdump.netns.veth.gz --]
[-- Type: application/x-gzip, Size: 181615 bytes --]
[-- Attachment #5: netstat.host --]
[-- Type: text/plain, Size: 1821 bytes --]
Ip:
277536 total packets received
20 forwarded
0 incoming packets discarded
202326 incoming packets delivered
108228 requests sent out
30 dropped because of missing route
Icmp:
10 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 5
echo requests: 5
6 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 1
echo replies: 5
IcmpMsg:
InType3: 5
InType8: 5
OutType0: 5
OutType3: 1
Tcp:
1491 active connections openings
12 passive connection openings
14 failed connection attempts
72 connection resets received
2 connections established
201920 segments received
107815 segments send out
0 segments retransmited
0 bad segments received.
1338 resets sent
Udp:
387 packets received
0 packets to unknown port received.
0 packet receive errors
389 packets sent
UdpLite:
TcpExt:
3 invalid SYN cookies received
4 TCP sockets finished time wait in fast timer
63 delayed acks sent
9 delayed acks further delayed because of locked socket
236 packets directly queued to recvmsg prequeue.
38600456 packets directly received from backlog
298101 packets directly received from prequeue
163501 packets header predicted
27103 packets header predicted and directly queued to user
2578 acknowledgments not containing data received
1018 predicted acknowledgments
15 connections reset due to unexpected data
72 connections reset due to early user close
TCPRcvCoalesce: 123
TCPOFOQueue: 1187
IpExt:
InBcastPkts: 9
OutBcastPkts: 1
InOctets: 296395504
OutOctets: 6965311
InBcastOctets: 2789
OutBcastOctets: 165
[-- Attachment #6: netstat.netns --]
[-- Type: text/plain, Size: 1463 bytes --]
Ip:
25483 total packets received
0 forwarded
0 incoming packets discarded
25483 incoming packets delivered
14572 requests sent out
Icmp:
4 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
echo requests: 2
echo replies: 2
4 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
echo request: 2
echo replies: 2
IcmpMsg:
InType0: 2
InType8: 2
OutType0: 2
OutType8: 2
Tcp:
1 active connections openings
0 passive connection openings
0 failed connection attempts
0 connection resets received
0 connections established
25473 segments received
14562 segments send out
0 segments retransmited
0 bad segments received.
77 resets sent
Udp:
6 packets received
0 packets to unknown port received.
0 packet receive errors
6 packets sent
UdpLite:
TcpExt:
38 delayed acks sent
Quick ack mode was activated 2 times
52 packets directly queued to recvmsg prequeue.
4916752 packets directly received from backlog
11584 packets directly received from prequeue
12538 packets header predicted
2649 packets header predicted and directly queued to user
1 acknowledgments not containing data received
2 DSACKs sent for old packets
1 connections reset due to unexpected data
TCPOFOQueue: 1580
IpExt:
InOctets: 38201843
OutOctets: 829966
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 17:40 ` Eric Dumazet
2012-12-29 18:29 ` Andrew Vagin
@ 2012-12-29 18:58 ` Eric Dumazet
2012-12-29 19:41 ` Eric Dumazet
2012-12-29 21:15 ` Andrew Vagin
1 sibling, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 18:58 UTC (permalink / raw)
To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
>
> Please post your new tcpdump then ;)
>
> also post "netstat -s" from root and test ns after your wgets
Also try following bnx2 patch.
It should help GRO / TCP coalesce
bnx2 should be the last driver not using skb head_frag
diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index a1adfaf..08a2d40 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
rx_pg->page = NULL;
}
+static void bnx2_frag_free(const struct bnx2 *bp, void *data)
+{
+ if (bp->rx_frag_size)
+ put_page(virt_to_head_page(data));
+ else
+ kfree(data);
+}
+
static inline int
bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp)
{
@@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
struct bnx2_rx_bd *rxbd =
&rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)];
- data = kmalloc(bp->rx_buf_size, gfp);
+ if (bp->rx_frag_size)
+ data = netdev_alloc_frag(bp->rx_frag_size);
+ else
+ data = kmalloc(bp->rx_buf_size, gfp);
if (!data)
return -ENOMEM;
@@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
bp->rx_buf_use_size,
PCI_DMA_FROMDEVICE);
if (dma_mapping_error(&bp->pdev->dev, mapping)) {
- kfree(data);
+ bnx2_frag_free(bp, data);
return -EIO;
}
@@ -3014,9 +3025,9 @@ error:
dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
PCI_DMA_FROMDEVICE);
- skb = build_skb(data, 0);
+ skb = build_skb(data, bp->rx_frag_size);
if (!skb) {
- kfree(data);
+ bnx2_frag_free(bp, data);
goto error;
}
skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET);
@@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
/* hw alignment + build_skb() overhead*/
bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ if (bp->rx_buf_size <= PAGE_SIZE)
+ bp->rx_frag_size = bp->rx_buf_size;
+ else
+ bp->rx_frag_size = 0;
bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
bp->rx_ring_size = size;
bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS);
@@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp)
rx_buf->data = NULL;
- kfree(data);
+ bnx2_frag_free(bp, data);
}
for (j = 0; j < bp->rx_max_pg_ring_idx; j++)
bnx2_free_rx_page(bp, rxr, j);
diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h
index 172efbe..11f5dee 100644
--- a/drivers/net/ethernet/broadcom/bnx2.h
+++ b/drivers/net/ethernet/broadcom/bnx2.h
@@ -6804,6 +6804,7 @@ struct bnx2 {
u32 rx_buf_use_size; /* useable size */
u32 rx_buf_size; /* with alignment */
+ u32 rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */
u32 rx_copy_thresh;
u32 rx_jumbo_thresh;
u32 rx_max_ring_idx;
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 18:58 ` Eric Dumazet
@ 2012-12-29 19:41 ` Eric Dumazet
2012-12-29 20:08 ` Andrew Vagin
2012-12-29 21:15 ` Andrew Vagin
1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 19:41 UTC (permalink / raw)
To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
>
> >
> > Please post your new tcpdump then ;)
> >
> > also post "netstat -s" from root and test ns after your wgets
>
> Also try following bnx2 patch.
>
> It should help GRO / TCP coalesce
>
> bnx2 should be the last driver not using skb head_frag
And of course, you should make sure all your bnx2 interrupts are handled
by the same cpu.
Or else, packets might be reordered because the way dev_forward_skb()
works.
(CPU X gets a bunch of packets from eth0, forward them via netif_rx() in
the local CPU X queue, NAPI is ended on eth0)
CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in
the local CPU Y queue.
CPU X and Y process their local queue in // -> packets are delivered Out
of order to TCP stack
Alternative is to setup RPS on your veth1 device, to force packets being
delivered/handled by a given cpu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 19:41 ` Eric Dumazet
@ 2012-12-29 20:08 ` Andrew Vagin
2012-12-29 20:20 ` Eric Dumazet
2012-12-29 21:12 ` Eric Dumazet
0 siblings, 2 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 20:08 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> >
> > >
> > > Please post your new tcpdump then ;)
> > >
> > > also post "netstat -s" from root and test ns after your wgets
> >
> > Also try following bnx2 patch.
> >
> > It should help GRO / TCP coalesce
> >
> > bnx2 should be the last driver not using skb head_frag
I don't have access to the host. I'm going to test your patch tomorrow.
Thanks.
>
> And of course, you should make sure all your bnx2 interrupts are handled
> by the same cpu.
All bnx interrupts are handled on all cpus. They are handled on the same
cpu, if a kernel is booted with msi_disable=1.
Is it right, that a received window will be less, if packets are not sorted?
Looks like a bug.
I want to say, that probably it works correctly, if packets are sorted.
But I think if packets are not sorted, it should work with the same
speed, cpu load and memory consumption may be a bit more.
>
> Or else, packets might be reordered because the way dev_forward_skb()
> works.
>
> (CPU X gets a bunch of packets from eth0, forward them via netif_rx() in
> the local CPU X queue, NAPI is ended on eth0)
>
> CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in
> the local CPU Y queue.
>
> CPU X and Y process their local queue in // -> packets are delivered Out
> of order to TCP stack
>
> Alternative is to setup RPS on your veth1 device, to force packets being
> delivered/handled by a given cpu
>
>
>
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 20:08 ` Andrew Vagin
@ 2012-12-29 20:20 ` Eric Dumazet
2012-12-29 21:07 ` Andrew Vagin
2012-12-29 21:12 ` Eric Dumazet
1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 20:20 UTC (permalink / raw)
To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > >
> > > >
> > > > Please post your new tcpdump then ;)
> > > >
> > > > also post "netstat -s" from root and test ns after your wgets
> > >
> > > Also try following bnx2 patch.
> > >
> > > It should help GRO / TCP coalesce
> > >
> > > bnx2 should be the last driver not using skb head_frag
>
> I don't have access to the host. I'm going to test your patch tomorrow.
> Thanks.
>
> >
> > And of course, you should make sure all your bnx2 interrupts are handled
> > by the same cpu.
> All bnx interrupts are handled on all cpus. They are handled on the same
> cpu, if a kernel is booted with msi_disable=1.
>
> Is it right, that a received window will be less, if packets are not sorted?
> Looks like a bug.
>
> I want to say, that probably it works correctly, if packets are sorted.
> But I think if packets are not sorted, it should work with the same
> speed, cpu load and memory consumption may be a bit more.
Without veth, it doesnt really matter that IRQ are spread on multiple
cpus, because packets are handled in NAPI, and only one cpu runs the
eth0 NAPI handler at one time.
But as soon as packets are queued (by netif_rx()) for 'later'
processing, you can have dramatic performance decrease.
Thats why you really should make sure IRQ on your eth0 device
are handled by a single cpu.
It will help to get better performance in most cases.
echo 1 >/proc/irq/*/eth0/../smp_affinity
If it doesnt work, you might try instead :
echo 1 >/proc/irq/default_smp_affinity
<you might need to reload bnx2 module, or ifdown/ifup eth0 >
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 20:20 ` Eric Dumazet
@ 2012-12-29 21:07 ` Andrew Vagin
0 siblings, 0 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 21:07 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
On Sat, Dec 29, 2012 at 12:20:07PM -0800, Eric Dumazet wrote:
> On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> > On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> > > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > > >
> > > > >
> > > > > Please post your new tcpdump then ;)
> > > > >
> > > > > also post "netstat -s" from root and test ns after your wgets
> > > >
> > > > Also try following bnx2 patch.
> > > >
> > > > It should help GRO / TCP coalesce
> > > >
> > > > bnx2 should be the last driver not using skb head_frag
> >
> > I don't have access to the host. I'm going to test your patch tomorrow.
> > Thanks.
> >
> > >
> > > And of course, you should make sure all your bnx2 interrupts are handled
> > > by the same cpu.
> > All bnx interrupts are handled on all cpus. They are handled on the same
> > cpu, if a kernel is booted with msi_disable=1.
> >
> > Is it right, that a received window will be less, if packets are not sorted?
> > Looks like a bug.
> >
> > I want to say, that probably it works correctly, if packets are sorted.
> > But I think if packets are not sorted, it should work with the same
> > speed, cpu load and memory consumption may be a bit more.
>
> Without veth, it doesnt really matter that IRQ are spread on multiple
> cpus, because packets are handled in NAPI, and only one cpu runs the
> eth0 NAPI handler at one time.
>
> But as soon as packets are queued (by netif_rx()) for 'later'
> processing, you can have dramatic performance decrease.
>
> Thats why you really should make sure IRQ on your eth0 device
> are handled by a single cpu.
>
> It will help to get better performance in most cases.
I understand this fact, but so big difference looks strange for me.
Default configuration (with the bug):
# cat /proc/interrupts | grep eth0
68: 10187 10188 10187 10023 10190 10185
10187 10019 PCI-MSI-edge eth0
>
> echo 1 >/proc/irq/*/eth0/../smp_affinity
This doesn't help.
I tryed echo 0 > /proc/irq/68/smp_affinity_list. This doesn't help too.
>
> If it doesnt work, you might try instead :
>
> echo 1 >/proc/irq/default_smp_affinity
> <you might need to reload bnx2 module, or ifdown/ifup eth0 >
This helps, and the bug are not reproduced in this case.
# cat /proc/interrupts | grep eth0
68: 60777 0 0 0 0 0
0 0 PCI-MSI-edge eth0
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 20:08 ` Andrew Vagin
2012-12-29 20:20 ` Eric Dumazet
@ 2012-12-29 21:12 ` Eric Dumazet
2012-12-29 21:19 ` Andrew Vagin
1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2012-12-29 21:12 UTC (permalink / raw)
To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> Is it right, that a received window will be less, if packets are not sorted?
> Looks like a bug.
Not really a bug.
TCP is very sensitive to packet reorders. I wont elaborate here as
its a bit off topic.
Try to reorders credits/debits on your bank account, I am pretty sure
you'll lose some money or even get serious troubles.
Of course, enabling GRO on eth0 would definitely help a bit...
(once/iff veth driver features are fixed to allow GSO packets being
forwarded without being segmented again)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 18:58 ` Eric Dumazet
2012-12-29 19:41 ` Eric Dumazet
@ 2012-12-29 21:15 ` Andrew Vagin
1 sibling, 0 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 21:15 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
On Sat, Dec 29, 2012 at 07:58:36PM +0100, Eric Dumazet wrote:
> Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
>
> >
> > Please post your new tcpdump then ;)
> >
> > also post "netstat -s" from root and test ns after your wgets
>
> Also try following bnx2 patch.
>
> It should help GRO / TCP coalesce
>
> bnx2 should be the last driver not using skb head_frag
>
This patch breaks nothing. I don't know what kind of profit I should get
with it:).
FYI:
I forgot to say, that I disable gro before collecting tcpdump, because
in this case tcpdump from veth and from eth0 can be compared easier.
>
> diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
> index a1adfaf..08a2d40 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.c
> +++ b/drivers/net/ethernet/broadcom/bnx2.c
> @@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
> rx_pg->page = NULL;
> }
>
> +static void bnx2_frag_free(const struct bnx2 *bp, void *data)
> +{
> + if (bp->rx_frag_size)
> + put_page(virt_to_head_page(data));
> + else
> + kfree(data);
> +}
> +
> static inline int
> bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp)
> {
> @@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
> struct bnx2_rx_bd *rxbd =
> &rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)];
>
> - data = kmalloc(bp->rx_buf_size, gfp);
> + if (bp->rx_frag_size)
> + data = netdev_alloc_frag(bp->rx_frag_size);
> + else
> + data = kmalloc(bp->rx_buf_size, gfp);
> if (!data)
> return -ENOMEM;
>
> @@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
> bp->rx_buf_use_size,
> PCI_DMA_FROMDEVICE);
> if (dma_mapping_error(&bp->pdev->dev, mapping)) {
> - kfree(data);
> + bnx2_frag_free(bp, data);
> return -EIO;
> }
>
> @@ -3014,9 +3025,9 @@ error:
>
> dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
> PCI_DMA_FROMDEVICE);
> - skb = build_skb(data, 0);
> + skb = build_skb(data, bp->rx_frag_size);
> if (!skb) {
> - kfree(data);
> + bnx2_frag_free(bp, data);
> goto error;
> }
> skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET);
> @@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
> /* hw alignment + build_skb() overhead*/
> bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
> NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> + if (bp->rx_buf_size <= PAGE_SIZE)
> + bp->rx_frag_size = bp->rx_buf_size;
> + else
> + bp->rx_frag_size = 0;
> bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
> bp->rx_ring_size = size;
> bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS);
> @@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp)
>
> rx_buf->data = NULL;
>
> - kfree(data);
> + bnx2_frag_free(bp, data);
> }
> for (j = 0; j < bp->rx_max_pg_ring_idx; j++)
> bnx2_free_rx_page(bp, rxr, j);
> diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h
> index 172efbe..11f5dee 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.h
> +++ b/drivers/net/ethernet/broadcom/bnx2.h
> @@ -6804,6 +6804,7 @@ struct bnx2 {
>
> u32 rx_buf_use_size; /* useable size */
> u32 rx_buf_size; /* with alignment */
> + u32 rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */
> u32 rx_copy_thresh;
> u32 rx_jumbo_thresh;
> u32 rx_max_ring_idx;
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Slow speed of tcp connections in a network namespace
2012-12-29 21:12 ` Eric Dumazet
@ 2012-12-29 21:19 ` Andrew Vagin
0 siblings, 0 replies; 16+ messages in thread
From: Andrew Vagin @ 2012-12-29 21:19 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
On Sat, Dec 29, 2012 at 01:12:26PM -0800, Eric Dumazet wrote:
> On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
>
> > Is it right, that a received window will be less, if packets are not sorted?
> > Looks like a bug.
>
> Not really a bug.
>
> TCP is very sensitive to packet reorders. I wont elaborate here as
> its a bit off topic.
>
> Try to reorders credits/debits on your bank account, I am pretty sure
> you'll lose some money or even get serious troubles.
>
> Of course, enabling GRO on eth0 would definitely help a bit...
>
> (once/iff veth driver features are fixed to allow GSO packets being
> forwarded without being segmented again)
>
Eric, thank you for the help.
I need time for thinking. I will ask you, if new questions will appear.
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] veth: extend device features
2012-12-29 16:01 ` Michał Mirosław
@ 2012-12-30 2:26 ` Eric Dumazet
2012-12-30 10:32 ` David Miller
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2012-12-30 2:26 UTC (permalink / raw)
To: Michał Mirosław, David Miller
Cc: Andrew Vagin, netdev, vvs, Michał Mirosław
From: Eric Dumazet <edumazet@google.com>
veth is lacking most modern facilities, like SG, checksums, TSO.
It makes sense to extend dev->features to get them, or GRO aggregation
is defeated by a forced segmentation.
Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
drivers/net/veth.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 95814d9..ccf211f 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_set_mac_address = eth_mac_addr,
};
+#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
+ NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
+ NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
+
static void veth_setup(struct net_device *dev)
{
ether_setup(dev);
@@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
dev->netdev_ops = &veth_netdev_ops;
dev->ethtool_ops = &veth_ethtool_ops;
dev->features |= NETIF_F_LLTX;
+ dev->features |= VETH_FEATURES;
dev->destructor = veth_dev_free;
- dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
+ dev->hw_features = VETH_FEATURES;
}
/*
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] veth: extend device features
2012-12-30 2:26 ` [PATCH] veth: extend device features Eric Dumazet
@ 2012-12-30 10:32 ` David Miller
0 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2012-12-30 10:32 UTC (permalink / raw)
To: erdnetdev; +Cc: mirqus, avagin, netdev, vvs, mirq-linux
From: Eric Dumazet <erdnetdev@gmail.com>
Date: Sat, 29 Dec 2012 18:26:10 -0800
> From: Eric Dumazet <edumazet@google.com>
>
> veth is lacking most modern facilities, like SG, checksums, TSO.
>
> It makes sense to extend dev->features to get them, or GRO aggregation
> is defeated by a forced segmentation.
>
> Reported-by: Andrew Vagin <avagin@parallels.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Applied.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2012-12-30 10:33 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-29 9:24 Slow speed of tcp connections in a network namespace Andrew Vagin
2012-12-29 13:53 ` Eric Dumazet
2012-12-29 14:50 ` Andrew Vagin
2012-12-29 17:40 ` Eric Dumazet
2012-12-29 18:29 ` Andrew Vagin
2012-12-29 18:58 ` Eric Dumazet
2012-12-29 19:41 ` Eric Dumazet
2012-12-29 20:08 ` Andrew Vagin
2012-12-29 20:20 ` Eric Dumazet
2012-12-29 21:07 ` Andrew Vagin
2012-12-29 21:12 ` Eric Dumazet
2012-12-29 21:19 ` Andrew Vagin
2012-12-29 21:15 ` Andrew Vagin
2012-12-29 16:01 ` Michał Mirosław
2012-12-30 2:26 ` [PATCH] veth: extend device features Eric Dumazet
2012-12-30 10:32 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).