* bridging: flow control regression
@ 2010-11-01 12:29 Simon Horman
2010-11-01 12:59 ` Eric Dumazet
0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-01 12:29 UTC (permalink / raw)
To: netdev; +Cc: Jay Vosburgh, Eric Dumazet, David S. Miller
Hi,
I have observed what appears to be a regression between 2.6.34 and
2.6.35-rc1. The behaviour described below is still present in Linus's
current tree (2.6.36+).
On 2.6.34 and earlier when sending a UDP stream to a bonded interface
the throughput is approximately equal to the available physical bandwidth.
# netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
172.17.50.253 (172.17.50.253) port 0 AF_INET
Socket Message Elapsed Messages CPU Service
Size Size Time Okay Errors Throughput Util Demand
bytes bytes secs # # 10^6bits/sec % SU us/KB
114688 1472 30.00 2438265 0 957.1 18.09 3.159
109568 30.00 2389980 938.1 -1.00 -1.000
On 2.6.35-rc1 netpref sends~7Gbits/s.
Curiously it only consumes 50% CPU, I would expect this to be CPU bound.
# netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
172.17.50.253 (172.17.50.253) port 0 AF_INET
Socket Message Elapsed Messages CPU Service
Size Size Time Okay Errors Throughput Util Demand
bytes bytes secs # # 10^6bits/sec % SU us/KB
116736 1472 30.00 18064360 0 7090.8 50.62 8.665
109568 30.00 2438090 957.0 -1.00 -1.000
In this case the bonding device has a single gitabit slave device
and is running in balance-rr mode. I have observed similar results
with two and three slave devices.
I have bisected the problem and the offending commit appears to be
"net: Introduce skb_orphan_try()". My tired eyes tell me that change
frees skb's earlier than they otherwise would be unless tx timestamping
is in effect. That does seem to make sense in relation to this problem,
though I am yet to dig into specifically why bonding is adversely affected.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bridging: flow control regression
2010-11-01 12:29 bridging: flow control regression Simon Horman
@ 2010-11-01 12:59 ` Eric Dumazet
2010-11-02 2:06 ` bonding: flow control regression [was Re: bridging: flow control regression] Simon Horman
0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2010-11-01 12:59 UTC (permalink / raw)
To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller
Le lundi 01 novembre 2010 à 21:29 +0900, Simon Horman a écrit :
> Hi,
>
> I have observed what appears to be a regression between 2.6.34 and
> 2.6.35-rc1. The behaviour described below is still present in Linus's
> current tree (2.6.36+).
>
> On 2.6.34 and earlier when sending a UDP stream to a bonded interface
> the throughput is approximately equal to the available physical bandwidth.
>
> # netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 172.17.50.253 (172.17.50.253) port 0 AF_INET
> Socket Message Elapsed Messages CPU Service
> Size Size Time Okay Errors Throughput Util Demand
> bytes bytes secs # # 10^6bits/sec % SU us/KB
>
> 114688 1472 30.00 2438265 0 957.1 18.09 3.159
> 109568 30.00 2389980 938.1 -1.00 -1.000
>
> On 2.6.35-rc1 netpref sends~7Gbits/s.
> Curiously it only consumes 50% CPU, I would expect this to be CPU bound.
>
> # netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 172.17.50.253 (172.17.50.253) port 0 AF_INET
> Socket Message Elapsed Messages CPU Service
> Size Size Time Okay Errors Throughput Util Demand
> bytes bytes secs # # 10^6bits/sec % SU us/KB
>
> 116736 1472 30.00 18064360 0 7090.8 50.62 8.665
> 109568 30.00 2438090 957.0 -1.00 -1.000
>
> In this case the bonding device has a single gitabit slave device
> and is running in balance-rr mode. I have observed similar results
> with two and three slave devices.
>
> I have bisected the problem and the offending commit appears to be
> "net: Introduce skb_orphan_try()". My tired eyes tell me that change
> frees skb's earlier than they otherwise would be unless tx timestamping
> is in effect. That does seem to make sense in relation to this problem,
> though I am yet to dig into specifically why bonding is adversely affected.
>
I assume you meant "bonding: flow control regression", ie this is not
related to bridging ?
One problem on bonding is that the xmit() method always returns
NETDEV_TX_OK.
So a flooder cannot know some of its frames were lost.
So yes, the patch you mention has the effect of allowing UDP to flood
bonding device, since we orphan skb before giving it to device (bond or
ethX)
With a normal device (with a qdisc), we queue skb, and orphan it only
when leaving queue. With a not too big socket send buffer, it slows down
the sender enough to "send UDP frames at line rate only"
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bonding: flow control regression [was Re: bridging: flow control regression]
2010-11-01 12:59 ` Eric Dumazet
@ 2010-11-02 2:06 ` Simon Horman
2010-11-02 4:53 ` Eric Dumazet
0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-02 2:06 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller
On Mon, Nov 01, 2010 at 01:59:32PM +0100, Eric Dumazet wrote:
> Le lundi 01 novembre 2010 à 21:29 +0900, Simon Horman a écrit :
> > Hi,
> >
> > I have observed what appears to be a regression between 2.6.34 and
> > 2.6.35-rc1. The behaviour described below is still present in Linus's
> > current tree (2.6.36+).
> >
> > On 2.6.34 and earlier when sending a UDP stream to a bonded interface
> > the throughput is approximately equal to the available physical bandwidth.
> >
> > # netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 172.17.50.253 (172.17.50.253) port 0 AF_INET
> > Socket Message Elapsed Messages CPU Service
> > Size Size Time Okay Errors Throughput Util Demand
> > bytes bytes secs # # 10^6bits/sec % SU us/KB
> >
> > 114688 1472 30.00 2438265 0 957.1 18.09 3.159
> > 109568 30.00 2389980 938.1 -1.00 -1.000
> >
> > On 2.6.35-rc1 netpref sends~7Gbits/s.
> > Curiously it only consumes 50% CPU, I would expect this to be CPU bound.
> >
> > # netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 172.17.50.253 (172.17.50.253) port 0 AF_INET
> > Socket Message Elapsed Messages CPU Service
> > Size Size Time Okay Errors Throughput Util Demand
> > bytes bytes secs # # 10^6bits/sec % SU us/KB
> >
> > 116736 1472 30.00 18064360 0 7090.8 50.62 8.665
> > 109568 30.00 2438090 957.0 -1.00 -1.000
> >
> > In this case the bonding device has a single gitabit slave device
> > and is running in balance-rr mode. I have observed similar results
> > with two and three slave devices.
> >
> > I have bisected the problem and the offending commit appears to be
> > "net: Introduce skb_orphan_try()". My tired eyes tell me that change
> > frees skb's earlier than they otherwise would be unless tx timestamping
> > is in effect. That does seem to make sense in relation to this problem,
> > though I am yet to dig into specifically why bonding is adversely affected.
> >
>
> I assume you meant "bonding: flow control regression", ie this is not
> related to bridging ?
Yes, sorry about that. I meant bonding not bridging.
> One problem on bonding is that the xmit() method always returns
> NETDEV_TX_OK.
>
> So a flooder cannot know some of its frames were lost.
>
> So yes, the patch you mention has the effect of allowing UDP to flood
> bonding device, since we orphan skb before giving it to device (bond or
> ethX)
>
> With a normal device (with a qdisc), we queue skb, and orphan it only
> when leaving queue. With a not too big socket send buffer, it slows down
> the sender enough to "send UDP frames at line rate only"
Thanks for the explanation.
I'm not entirely sure how much of a problem this is in practice.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bonding: flow control regression [was Re: bridging: flow control regression]
2010-11-02 2:06 ` bonding: flow control regression [was Re: bridging: flow control regression] Simon Horman
@ 2010-11-02 4:53 ` Eric Dumazet
2010-11-02 7:03 ` Simon Horman
0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2010-11-02 4:53 UTC (permalink / raw)
To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller
Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :
> Thanks for the explanation.
> I'm not entirely sure how much of a problem this is in practice.
Maybe for virtual devices (tunnels, bonding, ...), it would make sense
to delay the orphaning up to the real device.
But if the socket send buffer is very large, it would defeat the flow
control any way...
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bonding: flow control regression [was Re: bridging: flow control regression]
2010-11-02 4:53 ` Eric Dumazet
@ 2010-11-02 7:03 ` Simon Horman
2010-11-02 7:30 ` Eric Dumazet
0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-02 7:03 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller
On Tue, Nov 02, 2010 at 05:53:42AM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :
>
> > Thanks for the explanation.
> > I'm not entirely sure how much of a problem this is in practice.
>
> Maybe for virtual devices (tunnels, bonding, ...), it would make sense
> to delay the orphaning up to the real device.
That was my initial thought. Could you give me some guidance
on how that might be done so I can try and make a patch to test?
> But if the socket send buffer is very large, it would defeat the flow
> control any way...
I'm primarily concerned about a situation where
UDP packets are sent as fast as possible, indefinitely.
And in that scenario, I think it would need to be a rather large buffer.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bonding: flow control regression [was Re: bridging: flow control regression]
2010-11-02 7:03 ` Simon Horman
@ 2010-11-02 7:30 ` Eric Dumazet
2010-11-02 8:46 ` Simon Horman
0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2010-11-02 7:30 UTC (permalink / raw)
To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller
Le mardi 02 novembre 2010 à 16:03 +0900, Simon Horman a écrit :
> On Tue, Nov 02, 2010 at 05:53:42AM +0100, Eric Dumazet wrote:
> > Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :
> >
> > > Thanks for the explanation.
> > > I'm not entirely sure how much of a problem this is in practice.
> >
> > Maybe for virtual devices (tunnels, bonding, ...), it would make sense
> > to delay the orphaning up to the real device.
>
> That was my initial thought. Could you give me some guidance
> on how that might be done so I can try and make a patch to test?
>
> > But if the socket send buffer is very large, it would defeat the flow
> > control any way...
>
> I'm primarily concerned about a situation where
> UDP packets are sent as fast as possible, indefinitely.
> And in that scenario, I think it would need to be a rather large buffer.
>
Please try following patch, thanks.
drivers/net/bonding/bond_main.c | 1 +
include/linux/if.h | 3 +++
net/core/dev.c | 5 +++--
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index bdb68a6..325931e 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4714,6 +4714,7 @@ static void bond_setup(struct net_device *bond_dev)
bond_dev->flags |= IFF_MASTER|IFF_MULTICAST;
bond_dev->priv_flags |= IFF_BONDING;
bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+ bond_dev->priv_flags &= ~IFF_EARLY_ORPHAN;
if (bond->params.arp_interval)
bond_dev->priv_flags |= IFF_MASTER_ARPMON;
diff --git a/include/linux/if.h b/include/linux/if.h
index 1239599..7499a99 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -77,6 +77,9 @@
#define IFF_BRIDGE_PORT 0x8000 /* device used as bridge port */
#define IFF_OVS_DATAPATH 0x10000 /* device used as Open vSwitch
* datapath port */
+#define IFF_EARLY_ORPHAN 0x20000 /* early orphan skbs in
+ * dev_hard_start_xmit()
+ */
#define IF_GET_IFACE 0x0001 /* for querying only */
#define IF_GET_PROTO 0x0002
diff --git a/net/core/dev.c b/net/core/dev.c
index 35dfb83..eabf94d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2005,7 +2005,8 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(skb);
- skb_orphan_try(skb);
+ if (dev->priv_flags & IFF_EARLY_ORPHAN)
+ skb_orphan_try(skb);
if (vlan_tx_tag_present(skb) &&
!(dev->features & NETIF_F_HW_VLAN_TX)) {
@@ -5590,7 +5591,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
INIT_LIST_HEAD(&dev->napi_list);
INIT_LIST_HEAD(&dev->unreg_list);
INIT_LIST_HEAD(&dev->link_watch_list);
- dev->priv_flags = IFF_XMIT_DST_RELEASE;
+ dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_EARLY_ORPHAN ;
setup(dev);
strcpy(dev->name, name);
return dev;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: bonding: flow control regression [was Re: bridging: flow control regression]
2010-11-02 7:30 ` Eric Dumazet
@ 2010-11-02 8:46 ` Simon Horman
2010-11-02 9:29 ` Eric Dumazet
0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-02 8:46 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller
On Tue, Nov 02, 2010 at 08:30:57AM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 16:03 +0900, Simon Horman a écrit :
> > On Tue, Nov 02, 2010 at 05:53:42AM +0100, Eric Dumazet wrote:
> > > Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :
> > >
> > > > Thanks for the explanation.
> > > > I'm not entirely sure how much of a problem this is in practice.
> > >
> > > Maybe for virtual devices (tunnels, bonding, ...), it would make sense
> > > to delay the orphaning up to the real device.
> >
> > That was my initial thought. Could you give me some guidance
> > on how that might be done so I can try and make a patch to test?
> >
> > > But if the socket send buffer is very large, it would defeat the flow
> > > control any way...
> >
> > I'm primarily concerned about a situation where
> > UDP packets are sent as fast as possible, indefinitely.
> > And in that scenario, I think it would need to be a rather large buffer.
> >
>
> Please try following patch, thanks.
Thanks Eric, that seems to resolve the problem that I was seeing.
With your patch I see:
No bonding
# netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Socket Message Elapsed Messages CPU Service
Size Size Time Okay Errors Throughput Util Demand
bytes bytes secs # # 10^6bits/sec % SU us/KB
116736 1472 30.00 2438413 0 957.2 8.52 1.458
129024 30.00 2438413 957.2 -1.00 -1.000
With bonding (one slave, the interface used in the test above)
netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Socket Message Elapsed Messages CPU Service
Size Size Time Okay Errors Throughput Util Demand
bytes bytes secs # # 10^6bits/sec % SU us/KB
116736 1472 30.00 2438390 0 957.1 8.97 1.535
129024 30.00 2438390 957.1 -1.00 -1.000
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bonding: flow control regression [was Re: bridging: flow control regression]
2010-11-02 8:46 ` Simon Horman
@ 2010-11-02 9:29 ` Eric Dumazet
2010-11-06 9:25 ` Simon Horman
0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2010-11-02 9:29 UTC (permalink / raw)
To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller
Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit :
> Thanks Eric, that seems to resolve the problem that I was seeing.
>
> With your patch I see:
>
> No bonding
>
> # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> Socket Message Elapsed Messages CPU Service
> Size Size Time Okay Errors Throughput Util Demand
> bytes bytes secs # # 10^6bits/sec % SU us/KB
>
> 116736 1472 30.00 2438413 0 957.2 8.52 1.458
> 129024 30.00 2438413 957.2 -1.00 -1.000
>
> With bonding (one slave, the interface used in the test above)
>
> netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> Socket Message Elapsed Messages CPU Service
> Size Size Time Okay Errors Throughput Util Demand
> bytes bytes secs # # 10^6bits/sec % SU us/KB
>
> 116736 1472 30.00 2438390 0 957.1 8.97 1.535
> 129024 30.00 2438390 957.1 -1.00 -1.000
>
Sure the patch helps when not too many flows are involved, but this is a
hack.
Say the device queue is 1000 packets, and you run a workload with 2000
sockets, it wont work...
Or device queue is 1000 packets, one flow, and socket send queue size
allows for more than 1000 packets to be 'in flight' (echo 2000000
>/proc/sys/net/core/wmem_default) , it wont work too with bonding, only
with devices with a qdisc sitting in the first device met after the
socket.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bonding: flow control regression [was Re: bridging: flow control regression]
2010-11-02 9:29 ` Eric Dumazet
@ 2010-11-06 9:25 ` Simon Horman
2010-12-08 13:22 ` Simon Horman
0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-06 9:25 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller
On Tue, Nov 02, 2010 at 10:29:45AM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit :
>
> > Thanks Eric, that seems to resolve the problem that I was seeing.
> >
> > With your patch I see:
> >
> > No bonding
> >
> > # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > Socket Message Elapsed Messages CPU Service
> > Size Size Time Okay Errors Throughput Util Demand
> > bytes bytes secs # # 10^6bits/sec % SU us/KB
> >
> > 116736 1472 30.00 2438413 0 957.2 8.52 1.458
> > 129024 30.00 2438413 957.2 -1.00 -1.000
> >
> > With bonding (one slave, the interface used in the test above)
> >
> > netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > Socket Message Elapsed Messages CPU Service
> > Size Size Time Okay Errors Throughput Util Demand
> > bytes bytes secs # # 10^6bits/sec % SU us/KB
> >
> > 116736 1472 30.00 2438390 0 957.1 8.97 1.535
> > 129024 30.00 2438390 957.1 -1.00 -1.000
> >
>
>
> Sure the patch helps when not too many flows are involved, but this is a
> hack.
>
> Say the device queue is 1000 packets, and you run a workload with 2000
> sockets, it wont work...
>
> Or device queue is 1000 packets, one flow, and socket send queue size
> allows for more than 1000 packets to be 'in flight' (echo 2000000
> >/proc/sys/net/core/wmem_default) , it wont work too with bonding, only
> with devices with a qdisc sitting in the first device met after the
> socket.
True, thanks for pointing that out.
The scenario that I am actually interested in is virtualisation.
And I believe that your patch helps the vhostnet case (I don't see
flow control problems with bonding + virtio without vhostnet). However,
I am unsure if there are also some easy work-arounds to degrade
flow control in the vhostnet case too.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bonding: flow control regression [was Re: bridging: flow control regression]
2010-11-06 9:25 ` Simon Horman
@ 2010-12-08 13:22 ` Simon Horman
2010-12-08 13:50 ` Eric Dumazet
0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-12-08 13:22 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller
On Sat, Nov 06, 2010 at 06:25:37PM +0900, Simon Horman wrote:
> On Tue, Nov 02, 2010 at 10:29:45AM +0100, Eric Dumazet wrote:
> > Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit :
> >
> > > Thanks Eric, that seems to resolve the problem that I was seeing.
> > >
> > > With your patch I see:
> > >
> > > No bonding
> > >
> > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > > Socket Message Elapsed Messages CPU Service
> > > Size Size Time Okay Errors Throughput Util Demand
> > > bytes bytes secs # # 10^6bits/sec % SU us/KB
> > >
> > > 116736 1472 30.00 2438413 0 957.2 8.52 1.458
> > > 129024 30.00 2438413 957.2 -1.00 -1.000
> > >
> > > With bonding (one slave, the interface used in the test above)
> > >
> > > netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > > Socket Message Elapsed Messages CPU Service
> > > Size Size Time Okay Errors Throughput Util Demand
> > > bytes bytes secs # # 10^6bits/sec % SU us/KB
> > >
> > > 116736 1472 30.00 2438390 0 957.1 8.97 1.535
> > > 129024 30.00 2438390 957.1 -1.00 -1.000
> > >
> >
> >
> > Sure the patch helps when not too many flows are involved, but this is a
> > hack.
> >
> > Say the device queue is 1000 packets, and you run a workload with 2000
> > sockets, it wont work...
> >
> > Or device queue is 1000 packets, one flow, and socket send queue size
> > allows for more than 1000 packets to be 'in flight' (echo 2000000
> > >/proc/sys/net/core/wmem_default) , it wont work too with bonding, only
> > with devices with a qdisc sitting in the first device met after the
> > socket.
>
> True, thanks for pointing that out.
>
> The scenario that I am actually interested in is virtualisation.
> And I believe that your patch helps the vhostnet case (I don't see
> flow control problems with bonding + virtio without vhostnet). However,
> I am unsure if there are also some easy work-arounds to degrade
> flow control in the vhostnet case too.
Hi Eric,
do you have any thoughts on this?
I measured the performance impact of your patch on 2.6.37-rc1
and I can see why early orphaning is a win.
The tests are run over a bond with 3 slaves.
The bond is in rr-balance mode. Other parameters of interest are:
MTU=1500
client,server:tcp_reordering=3(default)
client:GSO=off,
client:TSO=off
server:GRO=off
server:rx-usecs=3(default)
Without your no early-orphan patch
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
87380 16384 16384 10.00 1621.03 16.31 6.48 1.648 2.621
With your no early-orphan patch
# netperf -C -c -4 -t TCP_STREAM -H 172.17.60.216
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
87380 16384 16384 10.00 1433.48 9.60 5.45 1.098 2.490
However in the case of virtualisation I think it is a win to be able to do
flow control on UDP traffic from guests (using vitio). Am I missing
something and flow control can be bypassed anyway? If not perhaps making
the change that your patch makes configurable through proc or ethtool is an
option?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bonding: flow control regression [was Re: bridging: flow control regression]
2010-12-08 13:22 ` Simon Horman
@ 2010-12-08 13:50 ` Eric Dumazet
0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2010-12-08 13:50 UTC (permalink / raw)
To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller
Le mercredi 08 décembre 2010 à 22:22 +0900, Simon Horman a écrit :
> Hi Eric,
>
> do you have any thoughts on this?
>
> I measured the performance impact of your patch on 2.6.37-rc1
> and I can see why early orphaning is a win.
>
> The tests are run over a bond with 3 slaves.
> The bond is in rr-balance mode. Other parameters of interest are:
> MTU=1500
> client,server:tcp_reordering=3(default)
> client:GSO=off,
> client:TSO=off
> server:GRO=off
> server:rx-usecs=3(default)
>
> Without your no early-orphan patch
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 172.17.60.216 (172.17.60.216) port 0 AF_INET
> Recv Send Send Utilization Service Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
>
> 87380 16384 16384 10.00 1621.03 16.31 6.48 1.648 2.621
>
> With your no early-orphan patch
> # netperf -C -c -4 -t TCP_STREAM -H 172.17.60.216
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 172.17.60.216 (172.17.60.216) port 0 AF_INET
> Recv Send Send Utilization Service Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
>
> 87380 16384 16384 10.00 1433.48 9.60 5.45 1.098 2.490
>
It seems strange this makes such big difference with one flow
>
> However in the case of virtualisation I think it is a win to be able to do
> flow control on UDP traffic from guests (using vitio). Am I missing
> something and flow control can be bypassed anyway? If not perhaps making
> the change that your patch makes configurable through proc or ethtool is an
> option?
>
virtio_net start_xmit() does one skb_orphan() anyway, so not doing it
some nano seconds before wont change anything.
Real perf problem is when skb are queued (for example on eth driver TX
ring or qdisc queue), then freed some micro (or milli) seconds later.
Maybe your ethtool suggestion is the way to go, so that we can remove
special "skb_orphans()" that can be done in some drivers : Let core
network stack decide to skb_orphan() itself, not the driver.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2010-12-08 13:51 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-01 12:29 bridging: flow control regression Simon Horman
2010-11-01 12:59 ` Eric Dumazet
2010-11-02 2:06 ` bonding: flow control regression [was Re: bridging: flow control regression] Simon Horman
2010-11-02 4:53 ` Eric Dumazet
2010-11-02 7:03 ` Simon Horman
2010-11-02 7:30 ` Eric Dumazet
2010-11-02 8:46 ` Simon Horman
2010-11-02 9:29 ` Eric Dumazet
2010-11-06 9:25 ` Simon Horman
2010-12-08 13:22 ` Simon Horman
2010-12-08 13:50 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).