* error(s) in 2.6.23-rc5 bonding.txt ?
@ 2007-09-07 22:02 Rick Jones
2007-09-07 23:31 ` Jay Vosburgh
0 siblings, 1 reply; 6+ messages in thread
From: Rick Jones @ 2007-09-07 22:02 UTC (permalink / raw)
To: Linux Network Development list
I was perusing Documentation/networking/bonding.txt in a 2.6.23-rc5 tree
and came across the following discussing the round-robin scheduling:
> Note that this out of order delivery occurs when both the
> sending and receiving systems are utilizing a multiple
> interface bond. Consider a configuration in which a
> balance-rr bond feeds into a single higher capacity network
> channel (e.g., multiple 100Mb/sec ethernets feeding a single
> gigabit ethernet via an etherchannel capable switch). In this
> configuration, traffic sent from the multiple 100Mb devices to
> a destination connected to the gigabit device will not see
> packets out of order.
My first reaction was that this was incorrect - it didn't matter if the
receiver was using a single link or not because the packets flowing
across the multiple 100Mb links could hit the intermediate device out of
order and so stay that way across the GbE link.
Before I go and patch-out that text I thought I'd double check.
rick jones
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: error(s) in 2.6.23-rc5 bonding.txt ?
2007-09-07 22:02 error(s) in 2.6.23-rc5 bonding.txt ? Rick Jones
@ 2007-09-07 23:31 ` Jay Vosburgh
2007-09-07 23:46 ` Rick Jones
2007-09-08 6:05 ` Bill Fink
0 siblings, 2 replies; 6+ messages in thread
From: Jay Vosburgh @ 2007-09-07 23:31 UTC (permalink / raw)
To: Rick Jones; +Cc: Linux Network Development list
Rick Jones <rick.jones2@hp.com> wrote:
[...]
>> Note that this out of order delivery occurs when both the
>> sending and receiving systems are utilizing a multiple
>> interface bond. Consider a configuration in which a
>> balance-rr bond feeds into a single higher capacity network
>> channel (e.g., multiple 100Mb/sec ethernets feeding a single
>> gigabit ethernet via an etherchannel capable switch). In this
>> configuration, traffic sent from the multiple 100Mb devices to
>> a destination connected to the gigabit device will not see
>> packets out of order.
>
>My first reaction was that this was incorrect - it didn't matter if the
>receiver was using a single link or not because the packets flowing across
>the multiple 100Mb links could hit the intermediate device out of order
>and so stay that way across the GbE link.
Usually it does matter, at least at the time I tested this.
Usually, the even striping of traffic from the balance-rr mode
will deliver in-order to a single higher speed link (e.g., N 100Mb
feeding a single 1Gb). I say "usually" because, although I don't see it
happen with the equipment I have, I'm willing to believe that there are
gizmos that would "bundle" packets arriving on the switch ports.
The reordering (usually) occurs when packet coalescing stuff
(either interrupt mitigation on the device, or NAPI) happens at the
receiver end, after the packets are striped evenly into the interfaces,
e.g.,
eth0 eth1 eth2
P1 P2 P3
P4 P5 P6
P7 P8 P9
and then eth0 goes and grabs a bunch of its packets, then eth1,
and eth2 do the same afterwards, so the received order ends up something
like P1, P4, P7, P2, P5, P8, P3, P6, P9. In Ye Olde Dayes Of Yore, with
one packet per interrupt at 10 Mb/sec, this type of configuration
wouldn't reorder (or at least not as badly).
The text probably is lacking in some detail, though. The real
key is that the last sender before getting to the destination system has
to do the round-robin striping. Most switches that I'm familiar with
(again, never seen one, but willing to believe there is one) don't have
round-robin as a load balance option for etherchannel, and thus won't
evenly stripe traffic, but instead do some math on the packets so that a
given "connection" isn't split across ports.
That said, it's certainly plausible that, for a given set of N
ethernets all enslaved to a single bonding balance-rr, the individual
ethernets could get out of sync, as it were (e.g., one running a fuller
tx ring, and thus running "behind" the others). If bonding is the only
feeder of the devices, then for a continuous flow of traffic, all the
slaves will generally receive packets (from the kernel, for
transmission) at pretty much the same rate, and so they won't tend to
get ahead or behind.
I haven't investigated into this deeply for a few years, but
this is my recollection of what happened with the tests I did then. I
did testing with multiple 100Mb devices feeding either other sets of
100Mb devices or single gigabit devices. I'm willing to believe that
things have changed, and an N feeding into one configuration can
reorder, but I haven't seen it (or really looked for it; balance-rr
isn't much the rage these days).
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: error(s) in 2.6.23-rc5 bonding.txt ?
2007-09-07 23:31 ` Jay Vosburgh
@ 2007-09-07 23:46 ` Rick Jones
2007-09-08 1:01 ` Jay Vosburgh
2007-09-08 6:05 ` Bill Fink
1 sibling, 1 reply; 6+ messages in thread
From: Rick Jones @ 2007-09-07 23:46 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Linux Network Development list
> That said, it's certainly plausible that, for a given set of N
> ethernets all enslaved to a single bonding balance-rr, the individual
> ethernets could get out of sync, as it were (e.g., one running a fuller
> tx ring, and thus running "behind" the others).
That is the scenario of which I was thinking.
> If bonding is the only feeder of the devices, then for a continuous
> flow of traffic, all the slaves will generally receive packets (from
> the kernel, for transmission) at pretty much the same rate, and so
> they won't tend to get ahead or behind.
I could see that if there was just one TCP connection going doing bulk
or something, but if there were a bulk transmitter coupled with an
occasional request/response (ie netperf TCP_STREAM and a TCP_RR) i'd
think the tx rings would no longer remain balanced.
> I haven't investigated into this deeply for a few years, but
> this is my recollection of what happened with the tests I did then. I
> did testing with multiple 100Mb devices feeding either other sets of
> 100Mb devices or single gigabit devices. I'm willing to believe that
> things have changed, and an N feeding into one configuration can
> reorder, but I haven't seen it (or really looked for it; balance-rr
> isn't much the rage these days).
Are you OK with that block of text simply being yanked?
rick
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: error(s) in 2.6.23-rc5 bonding.txt ?
2007-09-07 23:46 ` Rick Jones
@ 2007-09-08 1:01 ` Jay Vosburgh
2007-09-28 21:31 ` Rick Jones
0 siblings, 1 reply; 6+ messages in thread
From: Jay Vosburgh @ 2007-09-08 1:01 UTC (permalink / raw)
To: Rick Jones; +Cc: Linux Network Development list
Rick Jones <rick.jones2@hp.com> wrote:
[...]
>> If bonding is the only feeder of the devices, then for a continuous
>> flow of traffic, all the slaves will generally receive packets (from
>> the kernel, for transmission) at pretty much the same rate, and so
>> they won't tend to get ahead or behind.
>
>I could see that if there was just one TCP connection going doing bulk or
>something, but if there were a bulk transmitter coupled with an occasional
>request/response (ie netperf TCP_STREAM and a TCP_RR) i'd think the tx
>rings would no longer remain balanced.
I'm not sure that would be the case, because even the traffic
"bump" from the TCP_RR would be funneled through the round-robin. So,
the next packet of the bulk transmit would simply be "pushed back" to
the next available interface.
Perhaps varying packet sizes would throw things out of whack, if
the small ones happened to line up all one one interface (regardless of
the other traffic).
A PAUSE frame to one interface would almost certainly get things
out of whack, but I don't know how long it would stay out of whack (or,
really, how likely getting a PAUSE is). Probably just as long as all of
the slaves are running at full speed.
>> I haven't investigated into this deeply for a few years, but
>> this is my recollection of what happened with the tests I did then. I
>> did testing with multiple 100Mb devices feeding either other sets of
>> 100Mb devices or single gigabit devices. I'm willing to believe that
>> things have changed, and an N feeding into one configuration can
>> reorder, but I haven't seen it (or really looked for it; balance-rr
>> isn't much the rage these days).
>
>Are you OK with that block of text simply being yanked?
Mmm... I'm an easy sell for a "usually" or other suitable caveat
added in strategic places (avoiding absolute statements and all that).
The text does reflect the results of experiments I ran at the time, so
I'm reluctant to toss it wholesale simply because we speculate over how
it might not be accurate.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: error(s) in 2.6.23-rc5 bonding.txt ?
2007-09-07 23:31 ` Jay Vosburgh
2007-09-07 23:46 ` Rick Jones
@ 2007-09-08 6:05 ` Bill Fink
1 sibling, 0 replies; 6+ messages in thread
From: Bill Fink @ 2007-09-08 6:05 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Rick Jones, Linux Network Development list
On Fri, 07 Sep 2007, Jay Vosburgh wrote:
> Rick Jones <rick.jones2@hp.com> wrote:
> [...]
> >> Note that this out of order delivery occurs when both the
> >> sending and receiving systems are utilizing a multiple
> >> interface bond. Consider a configuration in which a
> >> balance-rr bond feeds into a single higher capacity network
> >> channel (e.g., multiple 100Mb/sec ethernets feeding a single
> >> gigabit ethernet via an etherchannel capable switch). In this
> >> configuration, traffic sent from the multiple 100Mb devices to
> >> a destination connected to the gigabit device will not see
> >> packets out of order.
I would just change the last part of the last sentence to:
configuration, traffic sent from the multiple 100Mb devices to
a destination connected to the gigabit device will not usually
see packets out of order in the absence of congestion on the
outgoing gigabit ethernet interface.
If there was momentary congestion on the outgoing gigabit ethernet
interface, I suppose it would be possible to get out of order delivery
if some of the incoming packets on the striped round-robin interfaces
had to be buffered a short while before delivery was possible.
> The text probably is lacking in some detail, though. The real
> key is that the last sender before getting to the destination system has
> to do the round-robin striping. Most switches that I'm familiar with
> (again, never seen one, but willing to believe there is one) don't have
> round-robin as a load balance option for etherchannel, and thus won't
> evenly stripe traffic, but instead do some math on the packets so that a
> given "connection" isn't split across ports.
Just FYI, the "i" series of Extreme switches (and some of their other
switches) support round-robin load balancing. We consider this an
extremely useful feature (but only use it for switch-to-switch link
aggregation), and bemoan its lack in their newer switch offerings.
Force10 switches also have a pseudo round-robin load balancing capability
which they call packet-based, which works by distributing packets based
on the IP Identification field (and only works for IPv4). One downside
of the Force10 feature is that it is a global setting and thus affects
all link aggregated links (the Extreme feature can be set on a per
link aggregated link basis).
-Bill
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: error(s) in 2.6.23-rc5 bonding.txt ?
2007-09-08 1:01 ` Jay Vosburgh
@ 2007-09-28 21:31 ` Rick Jones
0 siblings, 0 replies; 6+ messages in thread
From: Rick Jones @ 2007-09-28 21:31 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Linux Network Development list
Well, I managed to concoct an updated test, this time with 1G's going into a
10G. A 2.6.23-rc8 kernel on the system with four, dual-port 82546GB's,
connected to an HP ProCurve 3500 series switch with a 10G link to a system
running 2.6.18-8.el5 (I was having difficulty getting cxgb3 going on my
kernel.org kernels - firmware mismatches - so I booted RHEL5 there).
I put all four 1G interfaces into a balance_rr (mode=0) bond and started running
just a single netperf TCP_STREAM test.
On the bonding side:
hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran
19050 segments retransmited
9349 fast retransmits
9698 forward retransmits
hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack
RX packets:50708119 errors:0 dropped:0 overruns:0 frame:0
TX packets:58801285 errors:0 dropped:0 overruns:0 carrier:0
hpcpc103:~/net-2.6.24/Documentation/networking# netperf -H 192.168.5.106
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.106
(192.168.5.106) port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.01 1267.99
hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran
20268 segments retransmited
9974 fast retransmits
10291 forward retransmits
hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack
RX packets:51636421 errors:0 dropped:0 overruns:0 frame:0
TX packets:59899089 errors:0 dropped:0 overruns:0 carrier:0
on the recieving side:
[root@hpcpc106 ~]# ifconfig eth5 | grep pack
RX packets:58802455 errors:0 dropped:0 overruns:0 frame:0
TX packets:50205304 errors:0 dropped:0 overruns:0 carrier:0
[root@hpcpc106 ~]# ifconfig eth5 | grep pack
RX packets:59900267 errors:0 dropped:0 overruns:0 frame:0
TX packets:51124138 errors:0 dropped:0 overruns:0 carrier:0
So, there were 20268 - 19050 or 1218 retransmissions during the test. The
sending side reported sending 59899089 - 58801285 or 1097804 packets, and the
receiver reported receiving 59900267 - 58802455 or 1097812 packets.
Unless the switch was only occasionally duplicating segments or something, it
looks like all the retransmissions were the result of duplicate ACKs from packet
reordering.
For grins I varied the "reordering" sysctl and got:
# netstat -s -t | grep retran; for i in 3 4 5 6 7 8 9 10 20 30; do sysctl -w
net.ipv4.tcp_reordering=$i; netperf -H 192.168.5.106 -P 0 -B "reorder $i";
netstat -s -t | grep retran; done
13735 segments retransmited
6581 fast retransmits
7151 forward retransmits
net.ipv4.tcp_reordering = 3
87380 16384 16384 10.01 1294.51 reorder 3
15127 segments retransmited
7330 fast retransmits
7794 forward retransmits
net.ipv4.tcp_reordering = 4
87380 16384 16384 10.01 1304.22 reorder 4
16103 segments retransmited
7807 fast retransmits
8293 forward retransmits
net.ipv4.tcp_reordering = 5
87380 16384 16384 10.01 1330.88 reorder 5
16763 segments retransmited
8155 fast retransmits
8605 forward retransmits
net.ipv4.tcp_reordering = 6
87380 16384 16384 10.01 1350.50 reorder 6
17134 segments retransmited
8356 fast retransmits
8775 forward retransmits
net.ipv4.tcp_reordering = 7
87380 16384 16384 10.01 1353.00 reorder 7
17492 segments retransmited
8553 fast retransmits
8936 forward retransmits
net.ipv4.tcp_reordering = 8
87380 16384 16384 10.01 1358.00 reorder 8
17649 segments retransmited
8625 fast retransmits
9021 forward retransmits
net.ipv4.tcp_reordering = 9
87380 16384 16384 10.01 1415.89 reorder 9
17736 segments retransmited
8666 fast retransmits
9067 forward retransmits
net.ipv4.tcp_reordering = 10
87380 16384 16384 10.01 1412.36 reorder 10
17773 segments retransmited
8684 fast retransmits
9086 forward retransmits
net.ipv4.tcp_reordering = 20
87380 16384 16384 10.01 1403.47 reorder 20
17773 segments retransmited
8684 fast retransmits
9086 forward retransmits
net.ipv4.tcp_reordering = 30
87380 16384 16384 10.01 1325.41 reorder 30
17773 segments retransmited
8684 fast retransmits
9086 forward retransmits
IE fast retrans from reordering until the reorder limit was reasonably well
above the number of links in the aggregate.
As for how things got reordered, knuth knows exactly why. But it didn't need
more than one connection, and that connection didn't have to vary the size of
what it was passing to send(). Netperf was not making send calls which were an
integral multiple of the MSS, which means that from time to time a short segment
would be queued to an interface in the bond. Also, two of the dual-port NICs
were on 66 MHz PCI-X busses, and the other two were on 133MHz PCI-X busses
(four busses in all) so the DMA times will have differed.
And as if this mail wasn't already long enough, here is some tcptrace summary
for the netperf data connection with reorder at 3:
================================
TCP connection 2:
host c: 192.168.5.103:52264
host d: 192.168.5.106:33940
complete conn: yes
first packet: Fri Sep 28 14:06:43.271692 2007
last packet: Fri Sep 28 14:06:53.277018 2007
elapsed time: 0:00:10.005326
total packets: 1556191
filename: trace
c->d: d->c:
total packets: 699400 total packets: 856791
ack pkts sent: 699399 ack pkts sent: 856791
pure acks sent: 2 pure acks sent: 856789
sack pkts sent: 0 sack pkts sent: 352480
dsack pkts sent: 0 dsack pkts sent: 948
max sack blks/ack: 0 max sack blks/ack: 3
unique bytes sent: 1180423912 unique bytes sent: 0
actual data pkts: 699397 actual data pkts: 0
actual data bytes: 1180581744 actual data bytes: 0
rexmt data pkts: 106 rexmt data pkts: 0
rexmt data bytes: 157832 rexmt data bytes: 0
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 202461 outoforder pkts: 0
pushed data pkts: 6057 pushed data pkts: 0
SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/1
req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
adv wind scale: 7 adv wind scale: 9
req sack: Y req sack: Y
sacks sent: 0 sacks sent: 352480
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 1460 bytes mss requested: 1460 bytes
max segm size: 8688 bytes max segm size: 0 bytes
min segm size: 8 bytes min segm size: 0 bytes
avg segm size: 1687 bytes avg segm size: 0 bytes
max win adv: 5888 bytes max win adv: 968704 bytes
min win adv: 5888 bytes min win adv: 8704 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 5888 bytes avg win adv: 798088 bytes
initial window: 2896 bytes initial window: 0 bytes
initial window: 2 pkts initial window: 0 pkts
ttl stream length: 1577454360 bytes ttl stream length: 0 bytes
missed data: 397030448 bytes missed data: 0 bytes
truncated data: 1159600134 bytes truncated data: 0 bytes
truncated packets: 699383 pkts truncated packets: 0 pkts
data xmit time: 10.005 secs data xmit time: 0.000 secs
idletime max: 7.5 ms idletime max: 7.4 ms
throughput: 117979555 Bps throughput: 0 Bps
This was taken at the receiving 10G NIC.
rick jones
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-09-28 21:31 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-07 22:02 error(s) in 2.6.23-rc5 bonding.txt ? Rick Jones
2007-09-07 23:31 ` Jay Vosburgh
2007-09-07 23:46 ` Rick Jones
2007-09-08 1:01 ` Jay Vosburgh
2007-09-28 21:31 ` Rick Jones
2007-09-08 6:05 ` Bill Fink
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).