* error(s) in 2.6.23-rc5 bonding.txt ?
@ 2007-09-07 22:02 Rick Jones
2007-09-07 23:31 ` Jay Vosburgh
0 siblings, 1 reply; 6+ messages in thread
From: Rick Jones @ 2007-09-07 22:02 UTC (permalink / raw)
To: Linux Network Development list
I was perusing Documentation/networking/bonding.txt in a 2.6.23-rc5 tree
and came across the following discussing the round-robin scheduling:
> Note that this out of order delivery occurs when both the
> sending and receiving systems are utilizing a multiple
> interface bond. Consider a configuration in which a
> balance-rr bond feeds into a single higher capacity network
> channel (e.g., multiple 100Mb/sec ethernets feeding a single
> gigabit ethernet via an etherchannel capable switch). In this
> configuration, traffic sent from the multiple 100Mb devices to
> a destination connected to the gigabit device will not see
> packets out of order.
My first reaction was that this was incorrect - it didn't matter if the
receiver was using a single link or not because the packets flowing
across the multiple 100Mb links could hit the intermediate device out of
order and so stay that way across the GbE link.
Before I go and patch-out that text I thought I'd double check.
rick jones
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: error(s) in 2.6.23-rc5 bonding.txt ? 2007-09-07 22:02 error(s) in 2.6.23-rc5 bonding.txt ? Rick Jones @ 2007-09-07 23:31 ` Jay Vosburgh 2007-09-07 23:46 ` Rick Jones 2007-09-08 6:05 ` Bill Fink 0 siblings, 2 replies; 6+ messages in thread From: Jay Vosburgh @ 2007-09-07 23:31 UTC (permalink / raw) To: Rick Jones; +Cc: Linux Network Development list Rick Jones <rick.jones2@hp.com> wrote: [...] >> Note that this out of order delivery occurs when both the >> sending and receiving systems are utilizing a multiple >> interface bond. Consider a configuration in which a >> balance-rr bond feeds into a single higher capacity network >> channel (e.g., multiple 100Mb/sec ethernets feeding a single >> gigabit ethernet via an etherchannel capable switch). In this >> configuration, traffic sent from the multiple 100Mb devices to >> a destination connected to the gigabit device will not see >> packets out of order. > >My first reaction was that this was incorrect - it didn't matter if the >receiver was using a single link or not because the packets flowing across >the multiple 100Mb links could hit the intermediate device out of order >and so stay that way across the GbE link. Usually it does matter, at least at the time I tested this. Usually, the even striping of traffic from the balance-rr mode will deliver in-order to a single higher speed link (e.g., N 100Mb feeding a single 1Gb). I say "usually" because, although I don't see it happen with the equipment I have, I'm willing to believe that there are gizmos that would "bundle" packets arriving on the switch ports. The reordering (usually) occurs when packet coalescing stuff (either interrupt mitigation on the device, or NAPI) happens at the receiver end, after the packets are striped evenly into the interfaces, e.g., eth0 eth1 eth2 P1 P2 P3 P4 P5 P6 P7 P8 P9 and then eth0 goes and grabs a bunch of its packets, then eth1, and eth2 do the same afterwards, so the received order ends up something like P1, P4, P7, P2, P5, P8, P3, P6, P9. In Ye Olde Dayes Of Yore, with one packet per interrupt at 10 Mb/sec, this type of configuration wouldn't reorder (or at least not as badly). The text probably is lacking in some detail, though. The real key is that the last sender before getting to the destination system has to do the round-robin striping. Most switches that I'm familiar with (again, never seen one, but willing to believe there is one) don't have round-robin as a load balance option for etherchannel, and thus won't evenly stripe traffic, but instead do some math on the packets so that a given "connection" isn't split across ports. That said, it's certainly plausible that, for a given set of N ethernets all enslaved to a single bonding balance-rr, the individual ethernets could get out of sync, as it were (e.g., one running a fuller tx ring, and thus running "behind" the others). If bonding is the only feeder of the devices, then for a continuous flow of traffic, all the slaves will generally receive packets (from the kernel, for transmission) at pretty much the same rate, and so they won't tend to get ahead or behind. I haven't investigated into this deeply for a few years, but this is my recollection of what happened with the tests I did then. I did testing with multiple 100Mb devices feeding either other sets of 100Mb devices or single gigabit devices. I'm willing to believe that things have changed, and an N feeding into one configuration can reorder, but I haven't seen it (or really looked for it; balance-rr isn't much the rage these days). -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: error(s) in 2.6.23-rc5 bonding.txt ? 2007-09-07 23:31 ` Jay Vosburgh @ 2007-09-07 23:46 ` Rick Jones 2007-09-08 1:01 ` Jay Vosburgh 2007-09-08 6:05 ` Bill Fink 1 sibling, 1 reply; 6+ messages in thread From: Rick Jones @ 2007-09-07 23:46 UTC (permalink / raw) To: Jay Vosburgh; +Cc: Linux Network Development list > That said, it's certainly plausible that, for a given set of N > ethernets all enslaved to a single bonding balance-rr, the individual > ethernets could get out of sync, as it were (e.g., one running a fuller > tx ring, and thus running "behind" the others). That is the scenario of which I was thinking. > If bonding is the only feeder of the devices, then for a continuous > flow of traffic, all the slaves will generally receive packets (from > the kernel, for transmission) at pretty much the same rate, and so > they won't tend to get ahead or behind. I could see that if there was just one TCP connection going doing bulk or something, but if there were a bulk transmitter coupled with an occasional request/response (ie netperf TCP_STREAM and a TCP_RR) i'd think the tx rings would no longer remain balanced. > I haven't investigated into this deeply for a few years, but > this is my recollection of what happened with the tests I did then. I > did testing with multiple 100Mb devices feeding either other sets of > 100Mb devices or single gigabit devices. I'm willing to believe that > things have changed, and an N feeding into one configuration can > reorder, but I haven't seen it (or really looked for it; balance-rr > isn't much the rage these days). Are you OK with that block of text simply being yanked? rick ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: error(s) in 2.6.23-rc5 bonding.txt ? 2007-09-07 23:46 ` Rick Jones @ 2007-09-08 1:01 ` Jay Vosburgh 2007-09-28 21:31 ` Rick Jones 0 siblings, 1 reply; 6+ messages in thread From: Jay Vosburgh @ 2007-09-08 1:01 UTC (permalink / raw) To: Rick Jones; +Cc: Linux Network Development list Rick Jones <rick.jones2@hp.com> wrote: [...] >> If bonding is the only feeder of the devices, then for a continuous >> flow of traffic, all the slaves will generally receive packets (from >> the kernel, for transmission) at pretty much the same rate, and so >> they won't tend to get ahead or behind. > >I could see that if there was just one TCP connection going doing bulk or >something, but if there were a bulk transmitter coupled with an occasional >request/response (ie netperf TCP_STREAM and a TCP_RR) i'd think the tx >rings would no longer remain balanced. I'm not sure that would be the case, because even the traffic "bump" from the TCP_RR would be funneled through the round-robin. So, the next packet of the bulk transmit would simply be "pushed back" to the next available interface. Perhaps varying packet sizes would throw things out of whack, if the small ones happened to line up all one one interface (regardless of the other traffic). A PAUSE frame to one interface would almost certainly get things out of whack, but I don't know how long it would stay out of whack (or, really, how likely getting a PAUSE is). Probably just as long as all of the slaves are running at full speed. >> I haven't investigated into this deeply for a few years, but >> this is my recollection of what happened with the tests I did then. I >> did testing with multiple 100Mb devices feeding either other sets of >> 100Mb devices or single gigabit devices. I'm willing to believe that >> things have changed, and an N feeding into one configuration can >> reorder, but I haven't seen it (or really looked for it; balance-rr >> isn't much the rage these days). > >Are you OK with that block of text simply being yanked? Mmm... I'm an easy sell for a "usually" or other suitable caveat added in strategic places (avoiding absolute statements and all that). The text does reflect the results of experiments I ran at the time, so I'm reluctant to toss it wholesale simply because we speculate over how it might not be accurate. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: error(s) in 2.6.23-rc5 bonding.txt ? 2007-09-08 1:01 ` Jay Vosburgh @ 2007-09-28 21:31 ` Rick Jones 0 siblings, 0 replies; 6+ messages in thread From: Rick Jones @ 2007-09-28 21:31 UTC (permalink / raw) To: Jay Vosburgh; +Cc: Linux Network Development list Well, I managed to concoct an updated test, this time with 1G's going into a 10G. A 2.6.23-rc8 kernel on the system with four, dual-port 82546GB's, connected to an HP ProCurve 3500 series switch with a 10G link to a system running 2.6.18-8.el5 (I was having difficulty getting cxgb3 going on my kernel.org kernels - firmware mismatches - so I booted RHEL5 there). I put all four 1G interfaces into a balance_rr (mode=0) bond and started running just a single netperf TCP_STREAM test. On the bonding side: hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran 19050 segments retransmited 9349 fast retransmits 9698 forward retransmits hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack RX packets:50708119 errors:0 dropped:0 overruns:0 frame:0 TX packets:58801285 errors:0 dropped:0 overruns:0 carrier:0 hpcpc103:~/net-2.6.24/Documentation/networking# netperf -H 192.168.5.106 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.106 (192.168.5.106) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.01 1267.99 hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran 20268 segments retransmited 9974 fast retransmits 10291 forward retransmits hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack RX packets:51636421 errors:0 dropped:0 overruns:0 frame:0 TX packets:59899089 errors:0 dropped:0 overruns:0 carrier:0 on the recieving side: [root@hpcpc106 ~]# ifconfig eth5 | grep pack RX packets:58802455 errors:0 dropped:0 overruns:0 frame:0 TX packets:50205304 errors:0 dropped:0 overruns:0 carrier:0 [root@hpcpc106 ~]# ifconfig eth5 | grep pack RX packets:59900267 errors:0 dropped:0 overruns:0 frame:0 TX packets:51124138 errors:0 dropped:0 overruns:0 carrier:0 So, there were 20268 - 19050 or 1218 retransmissions during the test. The sending side reported sending 59899089 - 58801285 or 1097804 packets, and the receiver reported receiving 59900267 - 58802455 or 1097812 packets. Unless the switch was only occasionally duplicating segments or something, it looks like all the retransmissions were the result of duplicate ACKs from packet reordering. For grins I varied the "reordering" sysctl and got: # netstat -s -t | grep retran; for i in 3 4 5 6 7 8 9 10 20 30; do sysctl -w net.ipv4.tcp_reordering=$i; netperf -H 192.168.5.106 -P 0 -B "reorder $i"; netstat -s -t | grep retran; done 13735 segments retransmited 6581 fast retransmits 7151 forward retransmits net.ipv4.tcp_reordering = 3 87380 16384 16384 10.01 1294.51 reorder 3 15127 segments retransmited 7330 fast retransmits 7794 forward retransmits net.ipv4.tcp_reordering = 4 87380 16384 16384 10.01 1304.22 reorder 4 16103 segments retransmited 7807 fast retransmits 8293 forward retransmits net.ipv4.tcp_reordering = 5 87380 16384 16384 10.01 1330.88 reorder 5 16763 segments retransmited 8155 fast retransmits 8605 forward retransmits net.ipv4.tcp_reordering = 6 87380 16384 16384 10.01 1350.50 reorder 6 17134 segments retransmited 8356 fast retransmits 8775 forward retransmits net.ipv4.tcp_reordering = 7 87380 16384 16384 10.01 1353.00 reorder 7 17492 segments retransmited 8553 fast retransmits 8936 forward retransmits net.ipv4.tcp_reordering = 8 87380 16384 16384 10.01 1358.00 reorder 8 17649 segments retransmited 8625 fast retransmits 9021 forward retransmits net.ipv4.tcp_reordering = 9 87380 16384 16384 10.01 1415.89 reorder 9 17736 segments retransmited 8666 fast retransmits 9067 forward retransmits net.ipv4.tcp_reordering = 10 87380 16384 16384 10.01 1412.36 reorder 10 17773 segments retransmited 8684 fast retransmits 9086 forward retransmits net.ipv4.tcp_reordering = 20 87380 16384 16384 10.01 1403.47 reorder 20 17773 segments retransmited 8684 fast retransmits 9086 forward retransmits net.ipv4.tcp_reordering = 30 87380 16384 16384 10.01 1325.41 reorder 30 17773 segments retransmited 8684 fast retransmits 9086 forward retransmits IE fast retrans from reordering until the reorder limit was reasonably well above the number of links in the aggregate. As for how things got reordered, knuth knows exactly why. But it didn't need more than one connection, and that connection didn't have to vary the size of what it was passing to send(). Netperf was not making send calls which were an integral multiple of the MSS, which means that from time to time a short segment would be queued to an interface in the bond. Also, two of the dual-port NICs were on 66 MHz PCI-X busses, and the other two were on 133MHz PCI-X busses (four busses in all) so the DMA times will have differed. And as if this mail wasn't already long enough, here is some tcptrace summary for the netperf data connection with reorder at 3: ================================ TCP connection 2: host c: 192.168.5.103:52264 host d: 192.168.5.106:33940 complete conn: yes first packet: Fri Sep 28 14:06:43.271692 2007 last packet: Fri Sep 28 14:06:53.277018 2007 elapsed time: 0:00:10.005326 total packets: 1556191 filename: trace c->d: d->c: total packets: 699400 total packets: 856791 ack pkts sent: 699399 ack pkts sent: 856791 pure acks sent: 2 pure acks sent: 856789 sack pkts sent: 0 sack pkts sent: 352480 dsack pkts sent: 0 dsack pkts sent: 948 max sack blks/ack: 0 max sack blks/ack: 3 unique bytes sent: 1180423912 unique bytes sent: 0 actual data pkts: 699397 actual data pkts: 0 actual data bytes: 1180581744 actual data bytes: 0 rexmt data pkts: 106 rexmt data pkts: 0 rexmt data bytes: 157832 rexmt data bytes: 0 zwnd probe pkts: 0 zwnd probe pkts: 0 zwnd probe bytes: 0 zwnd probe bytes: 0 outoforder pkts: 202461 outoforder pkts: 0 pushed data pkts: 6057 pushed data pkts: 0 SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/1 req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y adv wind scale: 7 adv wind scale: 9 req sack: Y req sack: Y sacks sent: 0 sacks sent: 352480 urgent data pkts: 0 pkts urgent data pkts: 0 pkts urgent data bytes: 0 bytes urgent data bytes: 0 bytes mss requested: 1460 bytes mss requested: 1460 bytes max segm size: 8688 bytes max segm size: 0 bytes min segm size: 8 bytes min segm size: 0 bytes avg segm size: 1687 bytes avg segm size: 0 bytes max win adv: 5888 bytes max win adv: 968704 bytes min win adv: 5888 bytes min win adv: 8704 bytes zero win adv: 0 times zero win adv: 0 times avg win adv: 5888 bytes avg win adv: 798088 bytes initial window: 2896 bytes initial window: 0 bytes initial window: 2 pkts initial window: 0 pkts ttl stream length: 1577454360 bytes ttl stream length: 0 bytes missed data: 397030448 bytes missed data: 0 bytes truncated data: 1159600134 bytes truncated data: 0 bytes truncated packets: 699383 pkts truncated packets: 0 pkts data xmit time: 10.005 secs data xmit time: 0.000 secs idletime max: 7.5 ms idletime max: 7.4 ms throughput: 117979555 Bps throughput: 0 Bps This was taken at the receiving 10G NIC. rick jones ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: error(s) in 2.6.23-rc5 bonding.txt ? 2007-09-07 23:31 ` Jay Vosburgh 2007-09-07 23:46 ` Rick Jones @ 2007-09-08 6:05 ` Bill Fink 1 sibling, 0 replies; 6+ messages in thread From: Bill Fink @ 2007-09-08 6:05 UTC (permalink / raw) To: Jay Vosburgh; +Cc: Rick Jones, Linux Network Development list On Fri, 07 Sep 2007, Jay Vosburgh wrote: > Rick Jones <rick.jones2@hp.com> wrote: > [...] > >> Note that this out of order delivery occurs when both the > >> sending and receiving systems are utilizing a multiple > >> interface bond. Consider a configuration in which a > >> balance-rr bond feeds into a single higher capacity network > >> channel (e.g., multiple 100Mb/sec ethernets feeding a single > >> gigabit ethernet via an etherchannel capable switch). In this > >> configuration, traffic sent from the multiple 100Mb devices to > >> a destination connected to the gigabit device will not see > >> packets out of order. I would just change the last part of the last sentence to: configuration, traffic sent from the multiple 100Mb devices to a destination connected to the gigabit device will not usually see packets out of order in the absence of congestion on the outgoing gigabit ethernet interface. If there was momentary congestion on the outgoing gigabit ethernet interface, I suppose it would be possible to get out of order delivery if some of the incoming packets on the striped round-robin interfaces had to be buffered a short while before delivery was possible. > The text probably is lacking in some detail, though. The real > key is that the last sender before getting to the destination system has > to do the round-robin striping. Most switches that I'm familiar with > (again, never seen one, but willing to believe there is one) don't have > round-robin as a load balance option for etherchannel, and thus won't > evenly stripe traffic, but instead do some math on the packets so that a > given "connection" isn't split across ports. Just FYI, the "i" series of Extreme switches (and some of their other switches) support round-robin load balancing. We consider this an extremely useful feature (but only use it for switch-to-switch link aggregation), and bemoan its lack in their newer switch offerings. Force10 switches also have a pseudo round-robin load balancing capability which they call packet-based, which works by distributing packets based on the IP Identification field (and only works for IPv4). One downside of the Force10 feature is that it is a global setting and thus affects all link aggregated links (the Extreme feature can be set on a per link aggregated link basis). -Bill ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-09-28 21:31 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-09-07 22:02 error(s) in 2.6.23-rc5 bonding.txt ? Rick Jones 2007-09-07 23:31 ` Jay Vosburgh 2007-09-07 23:46 ` Rick Jones 2007-09-08 1:01 ` Jay Vosburgh 2007-09-28 21:31 ` Rick Jones 2007-09-08 6:05 ` Bill Fink
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).