bridging: flow control regression

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* bridging: flow control regression
@ 2010-11-01 12:29 Simon Horman
  2010-11-01 12:59 ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-01 12:29 UTC (permalink / raw)
  To: netdev; +Cc: Jay Vosburgh, Eric Dumazet, David S. Miller

Hi,

I have observed what appears to be a regression between 2.6.34 and
2.6.35-rc1. The behaviour described below is still present in Linus's
current tree (2.6.36+).

On 2.6.34 and earlier when sending a UDP stream to a bonded interface
the throughput is approximately equal to the available physical bandwidth.

# netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
172.17.50.253 (172.17.50.253) port 0 AF_INET
Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB

114688    1472   30.00     2438265      0      957.1     18.09    3.159 
109568           30.00     2389980             938.1     -1.00    -1.000

On 2.6.35-rc1 netpref sends~7Gbits/s.
Curiously it only consumes 50% CPU, I would expect this to be CPU bound.

# netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
172.17.50.253 (172.17.50.253) port 0 AF_INET
Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB

116736    1472   30.00     18064360      0     7090.8     50.62    8.665 
109568           30.00     2438090             957.0     -1.00    -1.000

In this case the bonding device has a single gitabit slave device
and is running in balance-rr mode. I have observed similar results
with two and three slave devices.

I have bisected the problem and the offending commit appears to be
"net: Introduce skb_orphan_try()". My tired eyes tell me that change
frees skb's earlier than they otherwise would be unless tx timestamping
is in effect. That does seem to make sense in relation to this problem,
though I am yet to dig into specifically why bonding is adversely affected.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bridging: flow control regression
  2010-11-01 12:29 bridging: flow control regression Simon Horman
@ 2010-11-01 12:59 ` Eric Dumazet
  2010-11-02  2:06   ` bonding: flow control regression [was Re: bridging: flow control regression] Simon Horman
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2010-11-01 12:59 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller

Le lundi 01 novembre 2010 à 21:29 +0900, Simon Horman a écrit :
> Hi,
> 
> I have observed what appears to be a regression between 2.6.34 and
> 2.6.35-rc1. The behaviour described below is still present in Linus's
> current tree (2.6.36+).
> 
> On 2.6.34 and earlier when sending a UDP stream to a bonded interface
> the throughput is approximately equal to the available physical bandwidth.
> 
> # netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 172.17.50.253 (172.17.50.253) port 0 AF_INET
> Socket  Message  Elapsed      Messages                   CPU      Service
> Size    Size     Time         Okay Errors   Throughput   Util     Demand
> bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> 
> 114688    1472   30.00     2438265      0      957.1     18.09    3.159 
> 109568           30.00     2389980             938.1     -1.00    -1.000
> 
> On 2.6.35-rc1 netpref sends~7Gbits/s.
> Curiously it only consumes 50% CPU, I would expect this to be CPU bound.
> 
> # netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 172.17.50.253 (172.17.50.253) port 0 AF_INET
> Socket  Message  Elapsed      Messages                   CPU      Service
> Size    Size     Time         Okay Errors   Throughput   Util     Demand
> bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> 
> 116736    1472   30.00     18064360      0     7090.8     50.62    8.665 
> 109568           30.00     2438090             957.0     -1.00    -1.000
> 
> In this case the bonding device has a single gitabit slave device
> and is running in balance-rr mode. I have observed similar results
> with two and three slave devices.
> 
> I have bisected the problem and the offending commit appears to be
> "net: Introduce skb_orphan_try()". My tired eyes tell me that change
> frees skb's earlier than they otherwise would be unless tx timestamping
> is in effect. That does seem to make sense in relation to this problem,
> though I am yet to dig into specifically why bonding is adversely affected.
> 

I assume you meant "bonding: flow control regression", ie this is not
related to bridging ?

One problem on bonding is that the xmit() method always returns
NETDEV_TX_OK.

So a flooder cannot know some of its frames were lost.

So yes, the patch you mention has the effect of allowing UDP to flood
bonding device, since we orphan skb before giving it to device (bond or
ethX)

With a normal device (with a qdisc), we queue skb, and orphan it only
when leaving queue. With a not too big socket send buffer, it slows down
the sender enough to "send UDP frames at line rate only"




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bonding: flow control regression [was Re: bridging: flow control regression]
  2010-11-01 12:59 ` Eric Dumazet
@ 2010-11-02  2:06   ` Simon Horman
  2010-11-02  4:53     ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-02  2:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller

On Mon, Nov 01, 2010 at 01:59:32PM +0100, Eric Dumazet wrote:
> Le lundi 01 novembre 2010 à 21:29 +0900, Simon Horman a écrit :
> > Hi,
> > 
> > I have observed what appears to be a regression between 2.6.34 and
> > 2.6.35-rc1. The behaviour described below is still present in Linus's
> > current tree (2.6.36+).
> > 
> > On 2.6.34 and earlier when sending a UDP stream to a bonded interface
> > the throughput is approximately equal to the available physical bandwidth.
> > 
> > # netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 172.17.50.253 (172.17.50.253) port 0 AF_INET
> > Socket  Message  Elapsed      Messages                   CPU      Service
> > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > 
> > 114688    1472   30.00     2438265      0      957.1     18.09    3.159 
> > 109568           30.00     2389980             938.1     -1.00    -1.000
> > 
> > On 2.6.35-rc1 netpref sends~7Gbits/s.
> > Curiously it only consumes 50% CPU, I would expect this to be CPU bound.
> > 
> > # netperf -c -4 -t UDP_STREAM -H 172.17.50.253 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 172.17.50.253 (172.17.50.253) port 0 AF_INET
> > Socket  Message  Elapsed      Messages                   CPU      Service
> > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > 
> > 116736    1472   30.00     18064360      0     7090.8     50.62    8.665 
> > 109568           30.00     2438090             957.0     -1.00    -1.000
> > 
> > In this case the bonding device has a single gitabit slave device
> > and is running in balance-rr mode. I have observed similar results
> > with two and three slave devices.
> > 
> > I have bisected the problem and the offending commit appears to be
> > "net: Introduce skb_orphan_try()". My tired eyes tell me that change
> > frees skb's earlier than they otherwise would be unless tx timestamping
> > is in effect. That does seem to make sense in relation to this problem,
> > though I am yet to dig into specifically why bonding is adversely affected.
> > 
> 
> I assume you meant "bonding: flow control regression", ie this is not
> related to bridging ?

Yes, sorry about that. I meant bonding not bridging.

> One problem on bonding is that the xmit() method always returns
> NETDEV_TX_OK.
> 
> So a flooder cannot know some of its frames were lost.
> 
> So yes, the patch you mention has the effect of allowing UDP to flood
> bonding device, since we orphan skb before giving it to device (bond or
> ethX)
> 
> With a normal device (with a qdisc), we queue skb, and orphan it only
> when leaving queue. With a not too big socket send buffer, it slows down
> the sender enough to "send UDP frames at line rate only"

Thanks for the explanation.
I'm not entirely sure how much of a problem this is in practice.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bonding: flow control regression [was Re: bridging: flow control regression]
  2010-11-02  2:06   ` bonding: flow control regression [was Re: bridging: flow control regression] Simon Horman
@ 2010-11-02  4:53     ` Eric Dumazet
  2010-11-02  7:03       ` Simon Horman
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2010-11-02  4:53 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller

Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :

> Thanks for the explanation.
> I'm not entirely sure how much of a problem this is in practice.

Maybe for virtual devices (tunnels, bonding, ...), it would make sense
to delay the orphaning up to the real device.

But if the socket send buffer is very large, it would defeat the flow
control any way...




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bonding: flow control regression [was Re: bridging: flow control regression]
  2010-11-02  4:53     ` Eric Dumazet
@ 2010-11-02  7:03       ` Simon Horman
  2010-11-02  7:30         ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-02  7:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller

On Tue, Nov 02, 2010 at 05:53:42AM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :
> 
> > Thanks for the explanation.
> > I'm not entirely sure how much of a problem this is in practice.
> 
> Maybe for virtual devices (tunnels, bonding, ...), it would make sense
> to delay the orphaning up to the real device.

That was my initial thought. Could you give me some guidance
on how that might be done so I can try and make a patch to test?

> But if the socket send buffer is very large, it would defeat the flow
> control any way...

I'm primarily concerned about a situation where
UDP packets are sent as fast as possible, indefinitely.
And in that scenario, I think it would need to be a rather large buffer.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bonding: flow control regression [was Re: bridging: flow control regression]
  2010-11-02  7:03       ` Simon Horman
@ 2010-11-02  7:30         ` Eric Dumazet
  2010-11-02  8:46           ` Simon Horman
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2010-11-02  7:30 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller

Le mardi 02 novembre 2010 à 16:03 +0900, Simon Horman a écrit :
> On Tue, Nov 02, 2010 at 05:53:42AM +0100, Eric Dumazet wrote:
> > Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :
> > 
> > > Thanks for the explanation.
> > > I'm not entirely sure how much of a problem this is in practice.
> > 
> > Maybe for virtual devices (tunnels, bonding, ...), it would make sense
> > to delay the orphaning up to the real device.
> 
> That was my initial thought. Could you give me some guidance
> on how that might be done so I can try and make a patch to test?
> 
> > But if the socket send buffer is very large, it would defeat the flow
> > control any way...
> 
> I'm primarily concerned about a situation where
> UDP packets are sent as fast as possible, indefinitely.
> And in that scenario, I think it would need to be a rather large buffer.
> 

Please try following patch, thanks.

 drivers/net/bonding/bond_main.c |    1 +
 include/linux/if.h              |    3 +++
 net/core/dev.c                  |    5 +++--
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index bdb68a6..325931e 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4714,6 +4714,7 @@ static void bond_setup(struct net_device *bond_dev)
 	bond_dev->flags |= IFF_MASTER|IFF_MULTICAST;
 	bond_dev->priv_flags |= IFF_BONDING;
 	bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+	bond_dev->priv_flags &= ~IFF_EARLY_ORPHAN;
 
 	if (bond->params.arp_interval)
 		bond_dev->priv_flags |= IFF_MASTER_ARPMON;
diff --git a/include/linux/if.h b/include/linux/if.h
index 1239599..7499a99 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -77,6 +77,9 @@
 #define IFF_BRIDGE_PORT	0x8000		/* device used as bridge port */
 #define IFF_OVS_DATAPATH	0x10000	/* device used as Open vSwitch
 					 * datapath port */
+#define IFF_EARLY_ORPHAN	0x20000 /* early orphan skbs in
+					 * dev_hard_start_xmit()
+					 */
 
 #define IF_GET_IFACE	0x0001		/* for querying only */
 #define IF_GET_PROTO	0x0002
diff --git a/net/core/dev.c b/net/core/dev.c
index 35dfb83..eabf94d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2005,7 +2005,8 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(skb);
 
-		skb_orphan_try(skb);
+		if (dev->priv_flags & IFF_EARLY_ORPHAN)
+			skb_orphan_try(skb);
 
 		if (vlan_tx_tag_present(skb) &&
 		    !(dev->features & NETIF_F_HW_VLAN_TX)) {
@@ -5590,7 +5591,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	INIT_LIST_HEAD(&dev->napi_list);
 	INIT_LIST_HEAD(&dev->unreg_list);
 	INIT_LIST_HEAD(&dev->link_watch_list);
-	dev->priv_flags = IFF_XMIT_DST_RELEASE;
+	dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_EARLY_ORPHAN ;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: bonding: flow control regression [was Re: bridging: flow control regression]
  2010-11-02  7:30         ` Eric Dumazet
@ 2010-11-02  8:46           ` Simon Horman
  2010-11-02  9:29             ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-02  8:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller

On Tue, Nov 02, 2010 at 08:30:57AM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 16:03 +0900, Simon Horman a écrit :
> > On Tue, Nov 02, 2010 at 05:53:42AM +0100, Eric Dumazet wrote:
> > > Le mardi 02 novembre 2010 à 11:06 +0900, Simon Horman a écrit :
> > > 
> > > > Thanks for the explanation.
> > > > I'm not entirely sure how much of a problem this is in practice.
> > > 
> > > Maybe for virtual devices (tunnels, bonding, ...), it would make sense
> > > to delay the orphaning up to the real device.
> > 
> > That was my initial thought. Could you give me some guidance
> > on how that might be done so I can try and make a patch to test?
> > 
> > > But if the socket send buffer is very large, it would defeat the flow
> > > control any way...
> > 
> > I'm primarily concerned about a situation where
> > UDP packets are sent as fast as possible, indefinitely.
> > And in that scenario, I think it would need to be a rather large buffer.
> > 
> 
> Please try following patch, thanks.

Thanks Eric, that seems to resolve the problem that I was seeing.

With your patch I see:

No bonding

# netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB

116736    1472   30.00     2438413      0      957.2     8.52     1.458 
129024           30.00     2438413             957.2     -1.00    -1.000

With bonding (one slave, the interface used in the test above)

netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB

116736    1472   30.00     2438390      0      957.1     8.97     1.535 
129024           30.00     2438390             957.1     -1.00    -1.000


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bonding: flow control regression [was Re: bridging: flow control regression]
  2010-11-02  8:46           ` Simon Horman
@ 2010-11-02  9:29             ` Eric Dumazet
  2010-11-06  9:25               ` Simon Horman
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2010-11-02  9:29 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller

Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit :

> Thanks Eric, that seems to resolve the problem that I was seeing.
> 
> With your patch I see:
> 
> No bonding
> 
> # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> Socket  Message  Elapsed      Messages                   CPU      Service
> Size    Size     Time         Okay Errors   Throughput   Util     Demand
> bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> 
> 116736    1472   30.00     2438413      0      957.2     8.52     1.458 
> 129024           30.00     2438413             957.2     -1.00    -1.000
> 
> With bonding (one slave, the interface used in the test above)
> 
> netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> Socket  Message  Elapsed      Messages                   CPU      Service
> Size    Size     Time         Okay Errors   Throughput   Util     Demand
> bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> 
> 116736    1472   30.00     2438390      0      957.1     8.97     1.535 
> 129024           30.00     2438390             957.1     -1.00    -1.000
> 


Sure the patch helps when not too many flows are involved, but this is a
hack.

Say the device queue is 1000 packets, and you run a workload with 2000
sockets, it wont work...

Or device queue is 1000 packets, one flow, and socket send queue size
allows for more than 1000 packets to be 'in flight' (echo 2000000
>/proc/sys/net/core/wmem_default) , it wont work too with bonding, only
with devices with a qdisc sitting in the first device met after the
socket.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bonding: flow control regression [was Re: bridging: flow control regression]
  2010-11-02  9:29             ` Eric Dumazet
@ 2010-11-06  9:25               ` Simon Horman
  2010-12-08 13:22                 ` Simon Horman
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-11-06  9:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller

On Tue, Nov 02, 2010 at 10:29:45AM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit :
> 
> > Thanks Eric, that seems to resolve the problem that I was seeing.
> > 
> > With your patch I see:
> > 
> > No bonding
> > 
> > # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > Socket  Message  Elapsed      Messages                   CPU      Service
> > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > 
> > 116736    1472   30.00     2438413      0      957.2     8.52     1.458 
> > 129024           30.00     2438413             957.2     -1.00    -1.000
> > 
> > With bonding (one slave, the interface used in the test above)
> > 
> > netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > Socket  Message  Elapsed      Messages                   CPU      Service
> > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > 
> > 116736    1472   30.00     2438390      0      957.1     8.97     1.535 
> > 129024           30.00     2438390             957.1     -1.00    -1.000
> > 
> 
> 
> Sure the patch helps when not too many flows are involved, but this is a
> hack.
> 
> Say the device queue is 1000 packets, and you run a workload with 2000
> sockets, it wont work...
> 
> Or device queue is 1000 packets, one flow, and socket send queue size
> allows for more than 1000 packets to be 'in flight' (echo 2000000
> >/proc/sys/net/core/wmem_default) , it wont work too with bonding, only
> with devices with a qdisc sitting in the first device met after the
> socket.

True, thanks for pointing that out.

The scenario that I am actually interested in is virtualisation.
And I believe that your patch helps the vhostnet case (I don't see
flow control problems with bonding + virtio without vhostnet). However,
I am unsure if there are also some easy work-arounds to degrade
flow control in the vhostnet case too.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bonding: flow control regression [was Re: bridging: flow control regression]
  2010-11-06  9:25               ` Simon Horman
@ 2010-12-08 13:22                 ` Simon Horman
  2010-12-08 13:50                   ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Horman @ 2010-12-08 13:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jay Vosburgh, David S. Miller

On Sat, Nov 06, 2010 at 06:25:37PM +0900, Simon Horman wrote:
> On Tue, Nov 02, 2010 at 10:29:45AM +0100, Eric Dumazet wrote:
> > Le mardi 02 novembre 2010 à 17:46 +0900, Simon Horman a écrit :
> > 
> > > Thanks Eric, that seems to resolve the problem that I was seeing.
> > > 
> > > With your patch I see:
> > > 
> > > No bonding
> > > 
> > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > > Socket  Message  Elapsed      Messages                   CPU      Service
> > > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > > 
> > > 116736    1472   30.00     2438413      0      957.2     8.52     1.458 
> > > 129024           30.00     2438413             957.2     -1.00    -1.000
> > > 
> > > With bonding (one slave, the interface used in the test above)
> > > 
> > > netperf -c -4 -t UDP_STREAM -H 172.17.60.216 -l 30 -- -m 1472
> > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
> > > Socket  Message  Elapsed      Messages                   CPU      Service
> > > Size    Size     Time         Okay Errors   Throughput   Util     Demand
> > > bytes   bytes    secs            #      #   10^6bits/sec % SU     us/KB
> > > 
> > > 116736    1472   30.00     2438390      0      957.1     8.97     1.535 
> > > 129024           30.00     2438390             957.1     -1.00    -1.000
> > > 
> > 
> > 
> > Sure the patch helps when not too many flows are involved, but this is a
> > hack.
> > 
> > Say the device queue is 1000 packets, and you run a workload with 2000
> > sockets, it wont work...
> > 
> > Or device queue is 1000 packets, one flow, and socket send queue size
> > allows for more than 1000 packets to be 'in flight' (echo 2000000
> > >/proc/sys/net/core/wmem_default) , it wont work too with bonding, only
> > with devices with a qdisc sitting in the first device met after the
> > socket.
> 
> True, thanks for pointing that out.
> 
> The scenario that I am actually interested in is virtualisation.
> And I believe that your patch helps the vhostnet case (I don't see
> flow control problems with bonding + virtio without vhostnet). However,
> I am unsure if there are also some easy work-arounds to degrade
> flow control in the vhostnet case too.

Hi Eric,

do you have any thoughts on this?

I measured the performance impact of your patch on 2.6.37-rc1
and I can see why early orphaning is a win.

The tests are run over a bond with 3 slaves.
The bond is in rr-balance mode. Other parameters of interest are:
	MTU=1500
	client,server:tcp_reordering=3(default)
	client:GSO=off,
	client:TSO=off
	server:GRO=off
	server:rx-usecs=3(default)

Without your no early-orphan patch
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
	172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      1621.03   16.31    6.48     1.648   2.621

With your no early-orphan patch
# netperf -C -c -4 -t TCP_STREAM -H 172.17.60.216
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
	172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      1433.48   9.60     5.45     1.098   2.490


However in the case of virtualisation I think it is a win to be able to do
flow control on UDP traffic from guests (using vitio). Am I missing
something and flow control can be bypassed anyway? If not perhaps making
the change that your patch makes configurable through proc or ethtool is an
option?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bonding: flow control regression [was Re: bridging: flow control regression]
  2010-12-08 13:22                 ` Simon Horman
@ 2010-12-08 13:50                   ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2010-12-08 13:50 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev, Jay Vosburgh, David S. Miller

Le mercredi 08 décembre 2010 à 22:22 +0900, Simon Horman a écrit :

> Hi Eric,
> 
> do you have any thoughts on this?
> 
> I measured the performance impact of your patch on 2.6.37-rc1
> and I can see why early orphaning is a win.
> 
> The tests are run over a bond with 3 slaves.
> The bond is in rr-balance mode. Other parameters of interest are:
> 	MTU=1500
> 	client,server:tcp_reordering=3(default)
> 	client:GSO=off,
> 	client:TSO=off
> 	server:GRO=off
> 	server:rx-usecs=3(default)
> 
> Without your no early-orphan patch
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 	172.17.60.216 (172.17.60.216) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>  87380  16384  16384    10.00      1621.03   16.31    6.48     1.648   2.621
> 
> With your no early-orphan patch
> # netperf -C -c -4 -t TCP_STREAM -H 172.17.60.216
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 	172.17.60.216 (172.17.60.216) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> 
>  87380  16384  16384    10.00      1433.48   9.60     5.45     1.098   2.490
> 

It seems strange this makes such big difference with one flow


> 
> However in the case of virtualisation I think it is a win to be able to do
> flow control on UDP traffic from guests (using vitio). Am I missing
> something and flow control can be bypassed anyway? If not perhaps making
> the change that your patch makes configurable through proc or ethtool is an
> option?
> 

virtio_net start_xmit() does one skb_orphan() anyway, so not doing it
some nano seconds before wont change anything.

Real perf problem is when skb are queued (for example on eth driver TX
ring or qdisc queue), then freed some micro (or milli) seconds later.

Maybe your ethtool suggestion is the way to go, so that we can remove
special "skb_orphans()" that can be done in some drivers : Let core
network stack decide to skb_orphan() itself, not the driver.




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-12-08 13:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-01 12:29 bridging: flow control regression Simon Horman
2010-11-01 12:59 ` Eric Dumazet
2010-11-02  2:06   ` bonding: flow control regression [was Re: bridging: flow control regression] Simon Horman
2010-11-02  4:53     ` Eric Dumazet
2010-11-02  7:03       ` Simon Horman
2010-11-02  7:30         ` Eric Dumazet
2010-11-02  8:46           ` Simon Horman
2010-11-02  9:29             ` Eric Dumazet
2010-11-06  9:25               ` Simon Horman
2010-12-08 13:22                 ` Simon Horman
2010-12-08 13:50                   ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).