* Re: TEQL for bonding Multi Gbit Ethernet in a cluster?
2015-03-13 21:26 TEQL for bonding Multi Gbit Ethernet in a cluster? Wolfgang Rosner
@ 2015-03-14 0:39 ` Jay Vosburgh
2015-03-14 17:44 ` Wolfgang Rosner
2015-03-16 9:10 ` TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks Wolfgang Rosner
2 siblings, 0 replies; 4+ messages in thread
From: Jay Vosburgh @ 2015-03-14 0:39 UTC (permalink / raw)
To: lartc
Wolfgang Rosner <wrosner@tirnet.de> wrote:
>Hello,
>
>
>Can I use TEQL to aggreagate multiple Gbit ethernets in a multiple Switch
>Topology across multiple hosts?
>In my example, 17 hosts each having 6 GBit ethernet cards?
>
>Did anybody try and maybe even document such an approach?
>
>I tried layer 2 bonding as described here
>http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
>but have to struggle with disappointing performance gains, a misbehaving
>switch layer and problems during PXE-DHCP-Boot.
That text in the bonding documentation is fairly old, and
describes a configuration that is not common today. It worked at the
time because the three switches did not communicate, and the hardware of
the era delivered one packet per receive interrupt (think 10 Mb/sec).
The round-robin delivery of packets across interfaces generally stayed
in sync, as there was no packet coalescing on the receive side (no NAPI
in the kernel, either). The switches could be cheap unmamaged switches,
as there were no channel groups on any particular switch, and no sharing
of MAC tables between them.
It doesn't work well today, if for no other reason than
interrupt coalescing and NAPI on the receiver will induce serious out of
order delivery, and turning that off is not really an option.
>Googling for a more controllable, all-linux, maybe layer 3 alternative, I
>encountered LARTC.
>I think multirouting as in chapter 4.2.2 does not solve my problem, as I want
>to share bandwith for single large transfers, too.
>
>I'd like to try the TEQL approach of chapter 10, but there are some open
>questions:
>
>- How does the routing look like if I have 17 hosts connected by 6 interfaces
>each?
>
>I think I cannot use the /31 net approach on a 1-to-1 basis, since I have 17
>machines on each subnet.
>can I use /27 nets instead, allowing 30 hosts per subnet?
>
>Or do I need a /31 subnet for each pair of machines, on each switch device,
>which where a total of (17 x 16 /2) * 6 = 816 of /31 subnets?
>
>Is this idea correct
>- one IP-addess for teql0 and 6 x 1 IP for eth0 ... eth5 on each host
> equals 7 x 17 = 119 IP addresses in total
>- a route for each target on any physical interface on any host, pointing to
>the counterpart on the same subnet like
>
>route add -host <teql-IP-on-target> gw <matching-dev-IP-on-target>
>
>This still adds up to 16 peers x 6 Interfaces = 96 routes on each host.
>How does this affect performance?
>Of course I can script this, but is there a more "elegant" way?
>Like calculated / OR-ed filter addresses?
>
>- can I continue to use the pyhsical links directly, particularly for
>PXE-booting?
>
>- can I keep the switch configuration as one large network and let ARP/ layer
>3 sort out the details, or is it necessary/advantageous to configure all
>layer 3 subnets as seperate layer 2 Vlans as well?
>Or do I even need 816 vlans for 816 /31 subnets on a peer-to-peer-basis?
>
>- the clients run diskless on nfsroot, which is established by the dracut boot
>process.
>So either I have to establish the whole teql within dracut during boot, or I
>have to reconfigure the nework after boot, without dropping the running
>nfsroot. Is this possible?
>
>- I only find reports and advices for 2.X kernels on the list archives.
>Are there any advances on the TCP tuning issues in recent kernels?
>
>- can I expect a performance gain at all, or will the additional CPU overhead
>outweight the gain in badwith?
>
>- what are the recommended tools for testing and tuning?
>
>========================>
>what I have done so far:
>
>I'm just going to build a "poor man's beowulf cluster" from a bunch of used
>server parts, sourced on ebay.
>
>So I end up with a HP blade center with 16 blade servers in it, each equipped
>with 6 x 1 GBit ehternet ports.
>They are linked by HP Virtual connect ("VC") switch units, in the way that
>there are 6 VC, each with one port to every one of the blade servers.
>This mapping is hardwired by the blade center design.
>The VC ist administered and advertised like one large manageable switch, but
>with caveats, see below.
>
>The whole thing is connected to the outside world via a consumer grade PC
>acting as a gateway and file server with 2x4=8 Gbit ethernet for the cluster
>side.
>
>All boxes run on debian wheezy, with 3.19.0 vanilla on the gateway and
>debian 3.16.7-ckt4-3~bpo70+1 at the blades.
>Blades are bootet over DHCP/PXE/TFTP/nfsroot
>
>Of course I would like to utilize the full available network bandwith for
>interprocess communication.
>
>My first try was linux bonding with 802.3ad bonding policy.
>see
>http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Bonding_Driver_Options
>
>However, all traffic goes over one Interface only.
>Maximum throughput is ~ 900 MBit / s.
>
>Googling the issue, I learned that VC does not support LACP bonding across
>different VC modules, so they are only "little-bit-stackable-switches".
>
>Next try was bonding with balance-rr as given here:
>http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
>
>To get the whole symmetry described there,
>I connected the external gateway with 6 ethernet ports to each of the
>VC-modules on a 1:1 basis. However, this breaks PXE booting, since the PXE
>machine does not appear to support bonding, so even the first DHCP breaks.
>
>Current best setting is now having the blades on balancing-rr and the gateway
>connected by 8 parallel Gbit-links to one single VC-device and using LACP /
>802.3ad on this.
If you're testing your single stream throughput through this
LACP aggregation, you'll be limited by the throughput of one member of
that aggregation, as LACP will not stripe traffic.
>However, performance is still far beyond expectation:
>~ 2.5 GBit between two blades, using nfs copy of 3 GBit files located in
>ramdisk
>~ 0.9 GBit between server and blade via nfs copy
>~ 2,8 GBit running dbench -D /home 50 parallel on 16 clients
>
>I partially understand the last 2 figures as limitations of the 802.3ad LACP
>protocol.
Most link aggregation systems will keep packets for a given
conversation (connection) on just one aggregation member, specifically
to prevent reordering of packets. On linux, the bonding balance-rr mode
is the exception; the other modes use some type of hash or assignment to
determine the interface to transmit on, and won't stripe across multiple
interfaces.
Another issue is that, even if you round-robin from the host's
bond, if traffic has to transit through a switch aggregation (channel
group), it will rebalance the traffic on egress, and most likely funnel
it all back through a single switch port.
>I can see unequal load distribution in ifconfig stats.
>I can watch periodical ups and downs during the 5 min dbench run, so I suspect
>some kind of a TCP congestion issue.
>
>I still do not understand the limitations of the direct blade-to-blade
>transfer using the round-robin-policy. According to ifconfig, both incoming
>and outgoing traffic is equally distributed over all physical links.
>I'm afraid this has anything to do with TCP reordering / slow-down /
>congestion window.
Depending on your kernel, etc, you may be able to inspect some
reordering detection counters; netstat -s may report them, e.g.,
% netstat -s|grep -i reord
Detected reordering 20 times using time stamp
or you can hunt for the raw values in /proc/net/netstat or use
nstat to print them:
TcpExt: TCPFACKReorder 0
TcpExt: TCPSACKReorder 0
TcpExt: TCPRenoReorder 0
TcpExt: TCPTSReorder 20
You may also be able to tweak some interface paramaters and
improve things; I'll point you at this discussion from a few years ago:
http://lists.openwall.net/netdev/2011/08/25/88
I haven't tried what's described in the email in the
one-switch-per-interface sort of arrangement that blade environments
impose, and never really got bonding to work well for load balancing in
those type of environments.
One issue for production use was that if a switch port fails on
one of the switches, the other peers sending traffic into that switch
will lose any packets sent to the failed port because their local link
is up, even though a particular peer isn't reachable. That brings up
various cascade failover sorts of problems, or just interconnecting all
of the switches, which then gets confused by the bond's traffic wherein
the source MAC is the same for all interfaces.
-J
---
-Jay Vosburgh, jay.vosburgh@canonical.com
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: TEQL for bonding Multi Gbit Ethernet in a cluster?
2015-03-13 21:26 TEQL for bonding Multi Gbit Ethernet in a cluster? Wolfgang Rosner
2015-03-14 0:39 ` Jay Vosburgh
@ 2015-03-14 17:44 ` Wolfgang Rosner
2015-03-16 9:10 ` TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks Wolfgang Rosner
2 siblings, 0 replies; 4+ messages in thread
From: Wolfgang Rosner @ 2015-03-14 17:44 UTC (permalink / raw)
To: lartc
Hello, Jay,
thanks for your prompt answer.
> You may also be able to tweak some interface paramaters and
> improve things; I'll point you at this discussion from a few years ago:
>
> http://lists.openwall.net/netdev/2011/08/25/88
OK.
I tried to tweak the rx-usecs as given there, but saw no reproducible
difference. My systems default was 18, and I tried both 6 and 45.
Regarding the TSO et al issue, I think this topic entered already the default
setting in recent systems:
root@blade-001:~# ethtool -k eth0 | grep offload
tcp-segmentation-offload: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
l2-fwd-offload: off [fixed]
However, following the hints in this article, I encountered the most obvious
way to tweak throghput:
jumbo packages.
Setting mtu = 9000 on both sides, I get
5200 MBit for netperf throughput, which is 86 % of theoretical maximum .
(was 4100 with mtu\x1500 before)
nfs transfer is at 3.4 GBit/s (was 2,7 GBit with mtu\x1500)
I had one encounter with 4.2 GBit, but cannot reproduce this
nfs options for the crossmunted /run/shm ramdisks are shown by mount as
192.168.130.2:/shm on /cluster/shm/node002 type nfs4
(rw,noatime,vers=4.0,rsize\x1048576,wsize\x1048576,namlen%5,soft,proto=tcp,port=0,timeo`0,retrans=1,sec=sys,clientaddr\x192.168.130.3,local_lock=none,addr\x192.168.130.2)
What I have configured in the automounter script:
$nfs_opts
= "-fstype=nfs4,sec=sys,async,noatime,fg,soft,intr,retrans=1,retry=0" ;
So - I haven't conifugred the rsize/wsize.
As RTFM says, client and server agree on the highest possible values, and try
to get to 1 MByte
Anyway, I get of topic, as this is not a nfs mailing list.
> % netstat -s|grep -i reord
> Detected reordering 20 times using time stamp
>
> or you can hunt for the raw values in /proc/net/netstat or use
> nstat to print them:
Hm. I see figures, but how to put meaning onto them?
before:
root@blade-002:~# netstat -s|grep -i reord
Detected reordering 1 times using FACK
root@blade-003:~# netstat -s|grep -i reord
Detected reordering 1 times using FACK
Detected reordering 1 times using SACK
now doing some work:
copying a 4 GB file over nfs between ram disks:
(from blade-003 to blade-002)
root@blade-002:~# time cp /cluster/shm/node003/random.002 /run/shm/random.002
real 0m8.701s
user 0m0.000s
sys 0m4.816s
after:
root@blade-002:~# netstat -s|grep -i reord
Detected reordering 1 times using FACK
root@blade-003:~# netstat -s|grep -i reord
Detected reordering 2 times using FACK
Detected reordering 234 times using SACK
Detected reordering 7 times using time stamp
wouldn't I have expected the reordering problems on the receivers side?
But I see it on the sender - I double and triple checked this....
Just in case you have an eye for peculiarities I do not see:
sender side
root@blade-003:~# nstat
#kernel
IpInReceives 721022 0.0
IpInDelivers 721022 0.0
IpOutRequests 550631 0.0
TcpActiveOpens 1 0.0
TcpPassiveOpens 1 0.0
TcpInSegs 720990 0.0
TcpOutSegs 2177539 0.0
TcpRetransSegs 4566 0.0
UdpInDatagrams 32 0.0
UdpOutDatagrams 2 0.0
TcpExtDelayedACKs 33 0.0
TcpExtTCPPrequeued 1 0.0
TcpExtTCPHPHits 2066 0.0
TcpExtTCPPureAcks 623665 0.0
TcpExtTCPHPAcks 38636 0.0
TcpExtTCPSackRecovery 423 0.0
TcpExtTCPFACKReorder 1 0.0
TcpExtTCPSACKReorder 233 0.0
TcpExtTCPTSReorder 7 0.0
TcpExtTCPFullUndo 19 0.0
TcpExtTCPPartialUndo 18 0.0
TcpExtTCPDSACKUndo 336 0.0
TcpExtTCPFastRetrans 1642 0.0
TcpExtTCPForwardRetrans 2924 0.0
TcpExtTCPDSACKRecv 3942 0.0
TcpExtTCPDSACKOfoRecv 6 0.0
TcpExtTCPDSACKIgnoredOld 15 0.0
TcpExtTCPDSACKIgnoredNoUndo 177 0.0
TcpExtTCPSackShifted 62709 0.0
TcpExtTCPSackMerged 261712 0.0
TcpExtTCPSackShiftFallback 404717 0.0
TcpExtTCPRetransFail 43 0.0
TcpExtTCPRcvCoalesce 536 0.0
TcpExtTCPOFOQueue 3 0.0
TcpExtTCPSpuriousRtxHostQueues 605 0.0
TcpExtTCPAutoCorking 58763 0.0
TcpExtTCPOrigDataSent 2176191 0.0
IpExtInBcastPkts 30 0.0
IpExtInOctets 53708303 0.0
IpExtOutOctets 3181655278 0.0
IpExtInBcastOctets 2280 0.0
IpExtInNoECTPkts 721719 0.0
receiver side:
root@blade-002:~# nstat
#kernel
IpInReceives 750213 0.0
IpInAddrErrors 2 0.0
IpInDelivers 750211 0.0
IpOutRequests 751510 0.0
IcmpInErrors 246 0.0
IcmpInCsumErrors 112 0.0
IcmpInTimeExcds 224 0.0
IcmpInEchoReps 3 0.0
IcmpInTimestamps 19 0.0
IcmpOutErrors 246 0.0
IcmpOutTimeExcds 224 0.0
IcmpOutEchoReps 19 0.0
IcmpOutTimestamps 3 0.0
IcmpMsgInType0 19 0.0
IcmpMsgInType3 224 0.0
IcmpMsgInType8 3 0.0
IcmpMsgOutType0 3 0.0
IcmpMsgOutType3 224 0.0
IcmpMsgOutType8 19 0.0
TcpActiveOpens 118 0.0
TcpPassiveOpens 10 0.0
TcpAttemptFails 112 0.0
TcpInSegs 748966 0.0
TcpOutSegs 751036 0.0
TcpRetransSegs 129 0.0
TcpOutRsts 2 0.0
UdpInDatagrams 871 0.0
UdpOutDatagrams 289 0.0
Ip6OutRequests 10 0.0
Ip6OutMcastPkts 16 0.0
Ip6OutOctets 688 0.0
Ip6OutMcastOctets 1144 0.0
Icmp6OutMsgs 10 0.0
Icmp6OutRouterSolicits 3 0.0
Icmp6OutNeighborSolicits 1 0.0
Icmp6OutMLDv2Reports 6 0.0
Icmp6OutType133 3 0.0
Icmp6OutType135 1 0.0
Icmp6OutType143 6 0.0
TcpExtPruneCalled 3 0.0
TcpExtTW 3 0.0
TcpExtDelayedACKs 372 0.0
TcpExtDelayedACKLocked 2 0.0
TcpExtDelayedACKLost 3926 0.0
TcpExtTCPPrequeued 2 0.0
TcpExtTCPHPHits 42851 0.0
TcpExtTCPPureAcks 1056 0.0
TcpExtTCPHPAcks 10889 0.0
TcpExtTCPSackRecovery 4 0.0
TcpExtTCPFACKReorder 1 0.0
TcpExtTCPDSACKUndo 2 0.0
TcpExtTCPFastRetrans 11 0.0
TcpExtTCPForwardRetrans 3 0.0
TcpExtTCPTimeouts 113 0.0
TcpExtTCPLossProbes 2 0.0
TcpExtTCPLossProbeRecovery 1 0.0
TcpExtTCPRcvCollapsed 543 0.0
TcpExtTCPDSACKOldSent 3951 0.0
TcpExtTCPDSACKOfoSent 6 0.0
TcpExtTCPDSACKRecv 13 0.0
TcpExtTCPDSACKIgnoredNoUndo 1 0.0
TcpExtTCPSackShifted 7 0.0
TcpExtTCPSackMerged 23 0.0
TcpExtTCPSackShiftFallback 72 0.0
TcpExtTCPBacklogDrop 204 0.0
TcpExtTCPRcvCoalesce 44267 0.0
TcpExtTCPOFOQueue 487750 0.0
TcpExtTCPOFOMerge 6 0.0
TcpExtTCPSpuriousRtxHostQueues 112 0.0
TcpExtTCPAutoCorking 1424 0.0
TcpExtTCPWantZeroWindowAdv 45 0.0
TcpExtTCPSynRetrans 112 0.0
TcpExtTCPOrigDataSent 16466 0.0
IpExtInBcastPkts 710 0.0
IpExtInOctets 3252256888 0.0
IpExtOutOctets 57783538 0.0
IpExtInBcastOctets 75470 0.0
IpExtInNoECTPkts 2236348 0.0
Anyway, I could live with this figures which I get between bonding interfaces
configured with balance-rr bonding.
However, when I switch over to the gateway, which is connected by a 802.3ad
bonding policy link, performance sucks:
from rr to 802.3ad
root@cruncher:/cluster/etc/scripts/available# time
cp /cluster/shm/node003/random.002 /run/shm/
real 1m20.708s
user 0m0.000s
sys 0m5.812s
=> 37 MByte / s = 300 GBit/s
from 802.3ad to rr
root@blade-002:~# time cp /cluster/shm/node000/random.002 /run/shm/random.002
real 0m26.747s
user 0m0.008s
sys 0m4.256s
=> 111 MByte / s = 888 GBit/s
> >I tried layer 2 bonding as described here
(... searching for a )
>> all-linux, maybe layer 3 alternative,
So maybe I'd leave the rr in place for peer-to-peer connections between the
blades and just have a layer-3-teql -like thing to the gateway?
hm. but can this work?
balance-rr bonding is syncing all MAC on the bond slaves. So I'm afraid there
is no longer a chance to mix it with assigning indidual IP's to the slave
interfaces, right?
But when all Interfaces have the same MAC, distribution is left to the switch,
which all the opacity problems I encountered.
So either I go all-layer-2 or all-layer-3, right?
> That text in the bonding documentation is fairly old, and
(...)
> It doesn't work well today, if for no other reason than
> interrupt coalescing and NAPI on the receiver will induce serious out of
> order delivery, and turning that off is not really an option.
well, as my figures above tell my, It's not that bad, as long as it can be
configured undisturbed on both sides and matches the switch topology.
> >- How does the routing look like if I have 17 hosts connected by 6
> > interfaces each?
As long as this question is not worked out, I have no chance to test teql on
my system, I'm afraid.
> >Current best setting is now having the blades on balancing-rr and the
> > gateway connected by 8 parallel Gbit-links to one single VC-device and
> > using LACP / 802.3ad on this.
>
> If you're testing your single stream throughput through this
> LACP aggregation, you'll be limited by the throughput of one member of
> that aggregation, as LACP will not stripe traffic.
I know.
That's the reason while I would like to do round robin.
What can I expect from teql as compared to rr-bonding and to LACP-bonding?
> Another issue is that, even if you round-robin from the host's
> bond, if traffic has to transit through a switch aggregation (channel
> group), it will rebalance the traffic on egress, and most likely funnel
> it all back through a single switch port.
Thats obvioulsy what happens in
blade <-> gateway connections
due to the asymmetric connection
In blade <-> blade peering, it works fine, as I wrote.
I Try to draw an ascii image of the topology:
+-+-+-+-+-+----blade-001
+-+-+-+-+-+----blade-002
+-+-+-+-+-+----blade-003
+-+-+-+-+-+----blade-004
+-+-+-+-+-+----blade-005
+-+-+-+-+-+----blade-006
+-+-+-+-+-+----blade-007
+-+-+-+-+-+----blade-008
+-+-+-+-+-+----blade-009
+-+-+-+-+-+----blade-010
+-+-+-+-+-+----blade-011
+-+-+-+-+-+----blade-012
+-+-+-+-+-+----blade-013
+-+-+-+-+-+----blade-014
+-+-+-+-+-+----blade-015
+-+-+-+-+-+----blade-016
+-------------eth2---gateway(aka cruncher)
+-------------eth3---gateway(aka cruncher)
+-------------eth4---gateway(aka cruncher)
+-------------eth5---gateway(aka cruncher)
+-------------eth6---gateway(aka cruncher)
+-------------eth7---gateway(aka cruncher)
+-------------eth8---gateway(aka cruncher)
+-------------eth9---gateway(aka cruncher)
Each + column is a VC-switching module
Blades have eth0 ... eth5 connected in the shown hadwired matrix way.
There are additional stacking links between the VC-switching modules not shown
here.
But it looks like the shortest path algorithm keeps rr neatly ordered between
blades.
But when I distribute the gateway connections equally to all switch modules,
only one of them is "link-active", the others are shown as "link-failover".
Only when I connect all of them to a single VC and configure them using LACP,
they are used in parallel. But not matching the round robing mode, right as
you mention.
> one-switch-per-interface sort of arrangement that blade environments
> impose, and never really got bonding to work well for load balancing in
> those type of environments.
>
> One issue for production use was that if a switch port fails on
> one of the switches, the other peers sending traffic into that switch
Well, I think there are different goals in Hig-PERFORMACE-clustering as
opposed to High-AVAILABILITY-clustering.
Most of "production use" referst to web server or enterprise system stuff,
which are basically HA, I'd say. And thats what those boxes are optimsied
for - see the link-failover issue above.
Setting up a new HPC-cluster with a bunch of dollar notes, I presumably would
simply go for infiniband instead of ethernet (or at least 10 GB ethernet),
but there's no budget way for that. I simly try to get best out of the stuff
that I can pick up at the lower end of the food chain ;-)
HM, so what??
I'll try to read the HP docu stuff whether I can get rid of the failover
behaviour. If I only could just rip off all the stacking links and let the VC
modules each behave as a "good old cheap and silly" switch....
Wolfgang Rosner
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks
2015-03-13 21:26 TEQL for bonding Multi Gbit Ethernet in a cluster? Wolfgang Rosner
2015-03-14 0:39 ` Jay Vosburgh
2015-03-14 17:44 ` Wolfgang Rosner
@ 2015-03-16 9:10 ` Wolfgang Rosner
2 siblings, 0 replies; 4+ messages in thread
From: Wolfgang Rosner @ 2015-03-16 9:10 UTC (permalink / raw)
To: lartc
Hello,
the good new in short: IT WORKS
I get 5.58 GBit / sec over 6 x 1 GBit between my blade nodes,
using layer 3 teql link aggregation:
root@blade-002:~# iperf -c 192.168.130.225
........
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 6.49 GBytes 5.58 Gbits/sec
The /27 net approach worked fine and straight forward.
Its a simple extension of the /31 approach described here
http://lartc.org/howto/lartc.loadshare.html
Just the default routes that come up when configuring the IP addresses.
I divided a /24 net into 8 chunks
- one for the boot configuration (PXE, nfsroot...)
- 6 for each parallel link subnets
- one for the teql subnet
root@blade-001:~# ip addr
....
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen
1000
link/ether 00:22:64:06:9b:7a brd ff:ff:ff:ff:ff:ff
inet 192.168.130.1/27 brd 192.168.130.31 scope global eth0
inet 192.168.130.33/27 scope global eth0:0
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen
1000
link/ether 00:22:64:06:db:4c brd ff:ff:ff:ff:ff:ff
inet 192.168.130.65/27 scope global eth1
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen
1000
link/ether 00:21:5a:af:8e:40 brd ff:ff:ff:ff:ff:ff
inet 192.168.130.97/27 scope global eth2
....
7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen
1000
link/ether 00:21:5a:af:8e:43 brd ff:ff:ff:ff:ff:ff
inet 192.168.130.193/27 scope global eth5
8: teql0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state
UNKNOWN qlen 100
link/void
inet 192.168.130.225/27 scope global teql0
valid_lft forever preferred_lft forever
(boring lines deleted)
Jumbo frames (mtu = 9000) are essential, they incrase throughput from ~ 3 GBit
(aka 50 % of theoretical maximum) to > 5.5 (aka > 90 %)
So far so good:
I can combine the performance of layer 2 aggregation (bonding) with layer 3
control of whats going on, getting clamps on nasty switch behaviour.
At least, so I hoped.
== QIRKS ==
But when it gets to transfer between the blade nodes and the external gateway,
things get funny again.
This is how the network now looks like:
The gateway aka cruncher is connected one-by-one Gbit cable to each of the six
VC swithces in the blade enclosure. For each VC bay (matching the
physical /27 subnets) I configured a separete vlan to convince VC to treat
the uplinks as parallel, not as failover.
+-------------eth4---gateway(aka cruncher)
| +-------------eth5---gateway(aka cruncher)
| | +-------------eth6---gateway(aka cruncher)
| | | +-------------eth7---gateway(aka cruncher)
| | | | +-------------eth8---gateway(aka cruncher)
| | | | | +-------------eth9---gateway(aka cruncher)
+-+-+-+-+-+----blade-001
+-+-+-+-+-+----blade-002
+-+-+-+-+-+----blade-003
Straight implementation of above scheme on the gateway yields not more than ~
2 GBit.
So, some aggregation happens, but far from the 6 GBit maximum.
ifconfig and wireshark show traffic coming equally over all 6 lines.
But with an awful lots of retransmits.
Well, maybe that wireshark gets confused by teql and fails matching packets
since they go over different interfaces, but thats another issue, not primary
here.
After lots of googling, I pinned the symptom down to this issue:
# for i in `seq 2 9`; do ethtool -S eth$i | grep rx_missed_errors ; done
rx_missed_errors: 0
rx_missed_errors: 0
rx_missed_errors: 0
rx_missed_errors: 0
rx_missed_errors: 29159
rx_missed_errors: 28619
rx_missed_errors: 9263
rx_missed_errors: 23306
from
http://osdir.com/ml/linux.drivers.e1000.devel/2007-11/msg00133.html
---<quote>--------------------
you are running out of bus bandwidth (which is why increasing
descriptors doesn't help). rx_missed_errors occur when you run out of
fifo on the adapter itself, indicating the bus can't be attained for
long enough to keep the data rate up.
---</quote>--------------------
eth2 .. eth5 and eth6 ... eth9 are a quad port 82571EB Gigabit Ethernet each.
extracted from lspci I find
0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
Gigabit Server Adapter
' +-0a.0-[05-08]----00.0-[06-08]--+-02.0-[07]--+-00.0
' | | \-00.1
' | \-04.0-[08]--+-00.0
' | \-00.1
' +-0b.0-[09]--+-00.0 | \-00.1
' +-0d.0-[0a-0d]----00.0-[0b-0d]--+-00.0-[0c]--+-00.0
' | | \-00.1
' | \-01.0-[0d]--+-00.0
' | \-00.1
so both adaptors have the same chipset, same driver, similar bus connectivity
and announce identical PCI bus bandwith:
'LnkSta: Speed 2.5GT/s, Width x4'
believing http://en.wikipedia.org/wiki/PCI_Express
this comes out to 8 Gbit /s, which should basically suffice, I think.
And on the "good" NIC, it actually does, obviously:
To check, and to increase safety head, I switched 2 cables from the "buggy"
NIC to the "healthy" one - and kept link konfig matching, of course.
and - alas - we get up from ~2 GBit to > 3 GBit.
Still thousands of rx_missed_errors
in the "bad" NIC, which has only to work for 2 GBit connections now, and
still zero of rx_missed_errors for the "good" NIC , which carries 4 GBit
active now.
Further googling and tweaking memory limits in
/proc/sys/net/ipv4/tcp_*mem
and
/proc/sys/net/core/*mem*
showed no difference.
What helped, was to incrase the "TCP window size" on the iperf server side
from
"TCP window size: 85.3 KByte (default)"
to a value between 512K and 2 M
root@cruncher:/cluster/etc/network# iperf -s -w1M
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.130.254 port 5001 connected with 192.168.130.226 port
33775
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 5.06 GBytes 4.35 Gbits/sec
Now we are over 70 % of theoretical maximum.
However, neither do I really understand it, nor do I know how to transfer
this window size setting to other applications.
I think the TCP window size is just a workaround for underlying problems,
because
- still lots of rx_missed_errors for eth6 and eth7
- the blade-blade connection with 5.6 GBit works even better without any
tweaking with small TCP window size:
root@blade-001:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.130.225 port 5001 connected with 192.168.130.226 port
49581
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 6.49 GBytes 5.58 Gbits/sec
Possible causes on my list
- firmware problem (NICs, Mainboard)
- hardware problem (NICs, Mainboard)
-some realy weird hidden tweak paramater
- conceptual limitation of hardware design
-some realy weird hidden tweak paramater
- driver problem
- kernel / scheduling issue / IRQ / race...whatever?
- still the nasty VC blade switch?
- any more?
The gateway mainboard is a SABERTOOTH 990FX R2.0
[AMD/ATI] RD890 PCI to PCI bridge (external gfx1 port A)
- consumer grade, but quite recent -
Gateway CPU is a AMD FX-8320 8 Core
Linux cruncher 3.19.0 #1 SMP Tue Mar 3 19:05:04 CET 2015 x86_64 GNU/Linux
The blade nodes are HP blades 460c G1
chipset Intel 5000
- enterprise grade, but quite some years now, I suppose -
CPU 2 x Xeon E5430 quad
Linux blade-002.crunchnet 3.16.0-0.bpo.4-amd64 #1 SMP Debian
3.16.7-ckt4-3~bpo70+1 (2015-02-12) x86_64 GNU/Linux
Testing memory bandwith with mbw (as a first measure of system bus thruput),
the Gateway outperforms the blades by a factor of two
root@blade-002:~# mbw -n1 1000
AVG Method: MEMCPY Elapsed: 0.61679 MiB: 1000.00000 Copy: 1621.300
MiB/s
AVG Method: DUMB Elapsed: 0.51892 MiB: 1000.00000 Copy: 1927.068
MiB/s
AVG Method: MCBLOCK Elapsed: 0.39211 MiB: 1000.00000 Copy: 2550.311
MiB/s
root@cruncher...# mbw -n1 1000
AVG Method: MEMCPY Elapsed: 0.27301 MiB: 1000.00000 Copy: 3662.923
MiB/s
AVG Method: DUMB Elapsed: 0.19693 MiB: 1000.00000 Copy: 5077.972
MiB/s
AVG Method: MCBLOCK Elapsed: 0.19287 MiB: 1000.00000 Copy: 5184.947
MiB/s
So, conceptually, I see no reason why from two nearly identical quad-GB
adapters, one should fail so badly on the faster system.
again compared lspci line by line and found a tiny difference:
Hewlett-Packard Company NC364T.... (the 'bad')
Region 0: Memory at fc400000 (32-bit, non-prefetchable) [size\x128K]
Region 1: Memory at fc300000 (32-bit, non-prefetchable) [sizeQ2K]
Region 2: I/O ports at 8000 [size2]
Intel Corporation PRO/1000 PT ...('the good')
Region 0: Memory at fc5a0000 (32-bit, non-prefetchable) [size\x128K]
Region 1: Memory at fc580000 (32-bit, non-prefetchable) [size\x128K]
Region 2: I/O ports at 5020 [size2]
so the "Region 2" memory is 4x larger in the 'bad' NIC.
Any clue whether this may be related?
Just an uneducated guess:
If it were some kind of pointer fifo into some buffer memory, the larger one
might run out of referred buffer, while the smaller does not????
How to proceed from "Guess" to "Know" to "Cure"?
Anybody any idea?
===========
just to exclude the idiots error, before hitting the send button:
I switched the cables to the faulty NIC (after now only two were left)
and rate on the teql link went down from > 2 Gbit to ~ 340 Kbits/sec
So, yes, cabling was right before,
and yes, the scheme provides some fault tolerance, albeit with severe hits in
performance.
Wolfgang Rosner
^ permalink raw reply [flat|nested] 4+ messages in thread