netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* error(s) in 2.6.23-rc5 bonding.txt ?
@ 2007-09-07 22:02 Rick Jones
  2007-09-07 23:31 ` Jay Vosburgh
  0 siblings, 1 reply; 6+ messages in thread
From: Rick Jones @ 2007-09-07 22:02 UTC (permalink / raw)
  To: Linux Network Development list

I was perusing Documentation/networking/bonding.txt in a 2.6.23-rc5 tree 
and came across the following discussing the round-robin scheduling:

>         Note that this out of order delivery occurs when both the
>         sending and receiving systems are utilizing a multiple
>         interface bond.  Consider a configuration in which a
>         balance-rr bond feeds into a single higher capacity network
>         channel (e.g., multiple 100Mb/sec ethernets feeding a single
>         gigabit ethernet via an etherchannel capable switch).  In this
>         configuration, traffic sent from the multiple 100Mb devices to
>         a destination connected to the gigabit device will not see
>         packets out of order.  

My first reaction was that this was incorrect - it didn't matter if the 
receiver was using a single link or not because the packets flowing 
across the multiple 100Mb links could hit the intermediate device out of 
order and so stay that way across the GbE link.

Before I go and patch-out that text I thought I'd double check.

rick jones

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: error(s) in 2.6.23-rc5 bonding.txt ?
  2007-09-07 22:02 error(s) in 2.6.23-rc5 bonding.txt ? Rick Jones
@ 2007-09-07 23:31 ` Jay Vosburgh
  2007-09-07 23:46   ` Rick Jones
  2007-09-08  6:05   ` Bill Fink
  0 siblings, 2 replies; 6+ messages in thread
From: Jay Vosburgh @ 2007-09-07 23:31 UTC (permalink / raw)
  To: Rick Jones; +Cc: Linux Network Development list

Rick Jones <rick.jones2@hp.com> wrote:
[...]
>>         Note that this out of order delivery occurs when both the
>>         sending and receiving systems are utilizing a multiple
>>         interface bond.  Consider a configuration in which a
>>         balance-rr bond feeds into a single higher capacity network
>>         channel (e.g., multiple 100Mb/sec ethernets feeding a single
>>         gigabit ethernet via an etherchannel capable switch).  In this
>>         configuration, traffic sent from the multiple 100Mb devices to
>>         a destination connected to the gigabit device will not see
>>         packets out of order.  
>
>My first reaction was that this was incorrect - it didn't matter if the
>receiver was using a single link or not because the packets flowing across
>the multiple 100Mb links could hit the intermediate device out of order
>and so stay that way across the GbE link.

	Usually it does matter, at least at the time I tested this.

	Usually, the even striping of traffic from the balance-rr mode
will deliver in-order to a single higher speed link (e.g., N 100Mb
feeding a single 1Gb).  I say "usually" because, although I don't see it
happen with the equipment I have, I'm willing to believe that there are
gizmos that would "bundle" packets arriving on the switch ports.

	The reordering (usually) occurs when packet coalescing stuff
(either interrupt mitigation on the device, or NAPI) happens at the
receiver end, after the packets are striped evenly into the interfaces,
e.g.,

	eth0	eth1	eth2
	P1	P2	P3
	P4	P5	P6
	P7	P8	P9

	and then eth0 goes and grabs a bunch of its packets, then eth1,
and eth2 do the same afterwards, so the received order ends up something
like P1, P4, P7, P2, P5, P8, P3, P6, P9.  In Ye Olde Dayes Of Yore, with
one packet per interrupt at 10 Mb/sec, this type of configuration
wouldn't reorder (or at least not as badly).

	The text probably is lacking in some detail, though.  The real
key is that the last sender before getting to the destination system has
to do the round-robin striping.  Most switches that I'm familiar with
(again, never seen one, but willing to believe there is one) don't have
round-robin as a load balance option for etherchannel, and thus won't
evenly stripe traffic, but instead do some math on the packets so that a
given "connection" isn't split across ports.

	That said, it's certainly plausible that, for a given set of N
ethernets all enslaved to a single bonding balance-rr, the individual
ethernets could get out of sync, as it were (e.g., one running a fuller
tx ring, and thus running "behind" the others).  If bonding is the only
feeder of the devices, then for a continuous flow of traffic, all the
slaves will generally receive packets (from the kernel, for
transmission) at pretty much the same rate, and so they won't tend to
get ahead or behind.

	I haven't investigated into this deeply for a few years, but
this is my recollection of what happened with the tests I did then.  I
did testing with multiple 100Mb devices feeding either other sets of
100Mb devices or single gigabit devices.  I'm willing to believe that
things have changed, and an N feeding into one configuration can
reorder, but I haven't seen it (or really looked for it; balance-rr
isn't much the rage these days).

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: error(s) in 2.6.23-rc5 bonding.txt ?
  2007-09-07 23:31 ` Jay Vosburgh
@ 2007-09-07 23:46   ` Rick Jones
  2007-09-08  1:01     ` Jay Vosburgh
  2007-09-08  6:05   ` Bill Fink
  1 sibling, 1 reply; 6+ messages in thread
From: Rick Jones @ 2007-09-07 23:46 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Linux Network Development list

> 	That said, it's certainly plausible that, for a given set of N
> ethernets all enslaved to a single bonding balance-rr, the individual
> ethernets could get out of sync, as it were (e.g., one running a fuller
> tx ring, and thus running "behind" the others). 

That is the scenario of which I was thinking.

> If bonding is the only feeder of the devices, then for a continuous
> flow of traffic, all the slaves will generally receive packets (from
> the kernel, for transmission) at pretty much the same rate, and so
> they won't tend to get ahead or behind.

I could see that if there was just one TCP connection going doing bulk 
or something, but if there were a bulk transmitter coupled with an 
occasional request/response (ie netperf TCP_STREAM and a TCP_RR) i'd 
think the tx rings would no longer remain balanced.

> 	I haven't investigated into this deeply for a few years, but
> this is my recollection of what happened with the tests I did then.  I
> did testing with multiple 100Mb devices feeding either other sets of
> 100Mb devices or single gigabit devices.  I'm willing to believe that
> things have changed, and an N feeding into one configuration can
> reorder, but I haven't seen it (or really looked for it; balance-rr
> isn't much the rage these days).

Are you OK with that block of text simply being yanked?

rick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: error(s) in 2.6.23-rc5 bonding.txt ?
  2007-09-07 23:46   ` Rick Jones
@ 2007-09-08  1:01     ` Jay Vosburgh
  2007-09-28 21:31       ` Rick Jones
  0 siblings, 1 reply; 6+ messages in thread
From: Jay Vosburgh @ 2007-09-08  1:01 UTC (permalink / raw)
  To: Rick Jones; +Cc: Linux Network Development list

Rick Jones <rick.jones2@hp.com> wrote:
[...]
>> If bonding is the only feeder of the devices, then for a continuous
>> flow of traffic, all the slaves will generally receive packets (from
>> the kernel, for transmission) at pretty much the same rate, and so
>> they won't tend to get ahead or behind.
>
>I could see that if there was just one TCP connection going doing bulk or
>something, but if there were a bulk transmitter coupled with an occasional
>request/response (ie netperf TCP_STREAM and a TCP_RR) i'd think the tx
>rings would no longer remain balanced.

	I'm not sure that would be the case, because even the traffic
"bump" from the TCP_RR would be funneled through the round-robin.  So,
the next packet of the bulk transmit would simply be "pushed back" to
the next available interface.

	Perhaps varying packet sizes would throw things out of whack, if
the small ones happened to line up all one one interface (regardless of
the other traffic).

	A PAUSE frame to one interface would almost certainly get things
out of whack, but I don't know how long it would stay out of whack (or,
really, how likely getting a PAUSE is).  Probably just as long as all of
the slaves are running at full speed.

>> 	I haven't investigated into this deeply for a few years, but
>> this is my recollection of what happened with the tests I did then.  I
>> did testing with multiple 100Mb devices feeding either other sets of
>> 100Mb devices or single gigabit devices.  I'm willing to believe that
>> things have changed, and an N feeding into one configuration can
>> reorder, but I haven't seen it (or really looked for it; balance-rr
>> isn't much the rage these days).
>
>Are you OK with that block of text simply being yanked?

	Mmm... I'm an easy sell for a "usually" or other suitable caveat
added in strategic places (avoiding absolute statements and all that).
The text does reflect the results of experiments I ran at the time, so
I'm reluctant to toss it wholesale simply because we speculate over how
it might not be accurate.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: error(s) in 2.6.23-rc5 bonding.txt ?
  2007-09-07 23:31 ` Jay Vosburgh
  2007-09-07 23:46   ` Rick Jones
@ 2007-09-08  6:05   ` Bill Fink
  1 sibling, 0 replies; 6+ messages in thread
From: Bill Fink @ 2007-09-08  6:05 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Rick Jones, Linux Network Development list

On Fri, 07 Sep 2007, Jay Vosburgh wrote:

> Rick Jones <rick.jones2@hp.com> wrote:
> [...]
> >>         Note that this out of order delivery occurs when both the
> >>         sending and receiving systems are utilizing a multiple
> >>         interface bond.  Consider a configuration in which a
> >>         balance-rr bond feeds into a single higher capacity network
> >>         channel (e.g., multiple 100Mb/sec ethernets feeding a single
> >>         gigabit ethernet via an etherchannel capable switch).  In this
> >>         configuration, traffic sent from the multiple 100Mb devices to
> >>         a destination connected to the gigabit device will not see
> >>         packets out of order.  

I would just change the last part of the last sentence to:

	configuration, traffic sent from the multiple 100Mb devices to
	a destination connected to the gigabit device will not usually
	see packets out of order in the absence of congestion on the
	outgoing gigabit ethernet interface.

If there was momentary congestion on the outgoing gigabit ethernet
interface, I suppose it would be possible to get out of order delivery
if some of the incoming packets on the striped round-robin interfaces
had to be buffered a short while before delivery was possible.

> 	The text probably is lacking in some detail, though.  The real
> key is that the last sender before getting to the destination system has
> to do the round-robin striping.  Most switches that I'm familiar with
> (again, never seen one, but willing to believe there is one) don't have
> round-robin as a load balance option for etherchannel, and thus won't
> evenly stripe traffic, but instead do some math on the packets so that a
> given "connection" isn't split across ports.

Just FYI, the "i" series of Extreme switches (and some of their other
switches) support round-robin load balancing.  We consider this an
extremely useful feature (but only use it for switch-to-switch link
aggregation), and bemoan its lack in their newer switch offerings.
Force10 switches also have a pseudo round-robin load balancing capability
which they call packet-based, which works by distributing packets based
on the IP Identification field (and only works for IPv4).  One downside
of the Force10 feature is that it is a global setting and thus affects
all link aggregated links (the Extreme feature can be set on a per
link aggregated link basis).

						-Bill

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: error(s) in 2.6.23-rc5 bonding.txt ?
  2007-09-08  1:01     ` Jay Vosburgh
@ 2007-09-28 21:31       ` Rick Jones
  0 siblings, 0 replies; 6+ messages in thread
From: Rick Jones @ 2007-09-28 21:31 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Linux Network Development list

Well, I managed to concoct an updated test, this time with 1G's going into a 
10G.  A 2.6.23-rc8 kernel on the system with four, dual-port 82546GB's, 
connected to an HP ProCurve 3500 series switch with a 10G link to a system 
running 2.6.18-8.el5 (I was having difficulty getting cxgb3 going on my 
kernel.org kernels - firmware mismatches - so I booted RHEL5 there).

I put all four 1G interfaces into a balance_rr (mode=0) bond and started running 
just a single netperf TCP_STREAM test.

On the bonding side:

hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran
     19050 segments retransmited
     9349 fast retransmits
     9698 forward retransmits
hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack
           RX packets:50708119 errors:0 dropped:0 overruns:0 frame:0
           TX packets:58801285 errors:0 dropped:0 overruns:0 carrier:0
hpcpc103:~/net-2.6.24/Documentation/networking# netperf -H 192.168.5.106
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.106 
(192.168.5.106) port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

  87380  16384  16384    10.01    1267.99
hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran
     20268 segments retransmited
     9974 fast retransmits
     10291 forward retransmits
hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack
           RX packets:51636421 errors:0 dropped:0 overruns:0 frame:0
           TX packets:59899089 errors:0 dropped:0 overruns:0 carrier:0

on the recieving side:

[root@hpcpc106 ~]# ifconfig eth5 | grep pack
           RX packets:58802455 errors:0 dropped:0 overruns:0 frame:0
           TX packets:50205304 errors:0 dropped:0 overruns:0 carrier:0
[root@hpcpc106 ~]# ifconfig eth5 | grep pack
           RX packets:59900267 errors:0 dropped:0 overruns:0 frame:0
           TX packets:51124138 errors:0 dropped:0 overruns:0 carrier:0

So, there were  20268 - 19050  or 1218 retransmissions during the test.  The 
sending side reported sending 59899089 - 58801285 or 1097804 packets, and the 
receiver reported receiving 59900267 - 58802455 or 1097812 packets.

Unless the switch was only occasionally duplicating segments or something, it 
looks like all the retransmissions were the result of duplicate ACKs from packet 
reordering.

For grins I varied the "reordering" sysctl and got:

# netstat -s -t | grep retran; for i in 3 4 5 6 7 8 9 10 20 30; do sysctl -w 
net.ipv4.tcp_reordering=$i; netperf -H 192.168.5.106 -P 0 -B "reorder $i"; 
netstat -s -t | grep retran; done
     13735 segments retransmited
     6581 fast retransmits
     7151 forward retransmits
net.ipv4.tcp_reordering = 3
  87380  16384  16384    10.01    1294.51   reorder 3
     15127 segments retransmited
     7330 fast retransmits
     7794 forward retransmits
net.ipv4.tcp_reordering = 4
  87380  16384  16384    10.01    1304.22   reorder 4
     16103 segments retransmited
     7807 fast retransmits
     8293 forward retransmits
net.ipv4.tcp_reordering = 5
  87380  16384  16384    10.01    1330.88   reorder 5
     16763 segments retransmited
     8155 fast retransmits
     8605 forward retransmits
net.ipv4.tcp_reordering = 6
  87380  16384  16384    10.01    1350.50   reorder 6
     17134 segments retransmited
     8356 fast retransmits
     8775 forward retransmits
net.ipv4.tcp_reordering = 7
  87380  16384  16384    10.01    1353.00   reorder 7
     17492 segments retransmited
     8553 fast retransmits
     8936 forward retransmits
net.ipv4.tcp_reordering = 8
  87380  16384  16384    10.01    1358.00   reorder 8
     17649 segments retransmited
     8625 fast retransmits
     9021 forward retransmits
net.ipv4.tcp_reordering = 9
  87380  16384  16384    10.01    1415.89   reorder 9
     17736 segments retransmited
     8666 fast retransmits
     9067 forward retransmits
net.ipv4.tcp_reordering = 10
  87380  16384  16384    10.01    1412.36   reorder 10
     17773 segments retransmited
     8684 fast retransmits
     9086 forward retransmits
net.ipv4.tcp_reordering = 20
  87380  16384  16384    10.01    1403.47   reorder 20
     17773 segments retransmited
     8684 fast retransmits
     9086 forward retransmits
net.ipv4.tcp_reordering = 30
  87380  16384  16384    10.01    1325.41   reorder 30
     17773 segments retransmited
     8684 fast retransmits
     9086 forward retransmits

IE fast retrans from reordering until the reorder limit was reasonably well 
above the number of links in the aggregate.

As for how things got reordered, knuth knows exactly why.  But it didn't need 
more than one connection, and that connection didn't have to vary the size of 
what it was passing to send(). Netperf was not making send calls which were an 
integral multiple of the MSS, which means that from time to time a short segment 
would be queued to an interface in the bond. Also, two of the dual-port NICs 
were on 66 MHz  PCI-X busses, and the other two were on 133MHz PCI-X busses 
(four busses in all) so the DMA times will have differed.


And as if this mail wasn't already long enough, here is some tcptrace summary 
for the netperf data connection with reorder at 3:

================================
TCP connection 2:
         host c:        192.168.5.103:52264
         host d:        192.168.5.106:33940
         complete conn: yes
         first packet:  Fri Sep 28 14:06:43.271692 2007
         last packet:   Fri Sep 28 14:06:53.277018 2007
         elapsed time:  0:00:10.005326
         total packets: 1556191
         filename:      trace
    c->d:                              d->c:
      total packets:        699400           total packets:        856791
      ack pkts sent:        699399           ack pkts sent:        856791
      pure acks sent:            2           pure acks sent:       856789
      sack pkts sent:            0           sack pkts sent:       352480
      dsack pkts sent:           0           dsack pkts sent:         948
      max sack blks/ack:         0           max sack blks/ack:         3
      unique bytes sent: 1180423912           unique bytes sent:         0
      actual data pkts:     699397           actual data pkts:          0
      actual data bytes: 1180581744           actual data bytes:         0
      rexmt data pkts:         106           rexmt data pkts:           0
      rexmt data bytes:     157832           rexmt data bytes:          0
      zwnd probe pkts:           0           zwnd probe pkts:           0
      zwnd probe bytes:          0           zwnd probe bytes:          0
      outoforder pkts:      202461           outoforder pkts:           0
      pushed data pkts:       6057           pushed data pkts:          0
      SYN/FIN pkts sent:       1/1           SYN/FIN pkts sent:       1/1
      req 1323 ws/ts:          Y/Y           req 1323 ws/ts:          Y/Y
      adv wind scale:            7           adv wind scale:            9
      req sack:                  Y           req sack:                  Y
      sacks sent:                0           sacks sent:           352480
      urgent data pkts:          0 pkts      urgent data pkts:          0 pkts
      urgent data bytes:         0 bytes     urgent data bytes:         0 bytes
      mss requested:          1460 bytes     mss requested:          1460 bytes
      max segm size:          8688 bytes     max segm size:             0 bytes
      min segm size:             8 bytes     min segm size:             0 bytes
      avg segm size:          1687 bytes     avg segm size:             0 bytes
      max win adv:            5888 bytes     max win adv:          968704 bytes
      min win adv:            5888 bytes     min win adv:            8704 bytes
      zero win adv:              0 times     zero win adv:              0 times
      avg win adv:            5888 bytes     avg win adv:          798088 bytes
      initial window:         2896 bytes     initial window:            0 bytes
      initial window:            2 pkts      initial window:            0 pkts
      ttl stream length: 1577454360 bytes     ttl stream length:         0 bytes
      missed data:       397030448 bytes     missed data:               0 bytes
      truncated data:    1159600134 bytes     truncated data:            0 bytes
      truncated packets:    699383 pkts      truncated packets:         0 pkts
      data xmit time:       10.005 secs      data xmit time:        0.000 secs
      idletime max:            7.5 ms        idletime max:            7.4 ms
      throughput:        117979555 Bps       throughput:                0 Bps

This was taken at the receiving 10G NIC.

rick jones

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-09-28 21:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-07 22:02 error(s) in 2.6.23-rc5 bonding.txt ? Rick Jones
2007-09-07 23:31 ` Jay Vosburgh
2007-09-07 23:46   ` Rick Jones
2007-09-08  1:01     ` Jay Vosburgh
2007-09-28 21:31       ` Rick Jones
2007-09-08  6:05   ` Bill Fink

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).