TEQL for bonding Multi Gbit Ethernet in a cluster?

All of lore.kernel.org
 help / color / mirror / Atom feed

* TEQL for bonding Multi Gbit Ethernet in a cluster?
@ 2015-03-13 21:26 Wolfgang Rosner
  2015-03-14  0:39 ` Jay Vosburgh
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Wolfgang Rosner @ 2015-03-13 21:26 UTC (permalink / raw)
  To: lartc

Hello,

Can I use TEQL to aggreagate multiple Gbit ethernets in a multiple Switch 
Topology across multiple hosts?
In my example, 17 hosts each having 6 GBit ethernet cards?

Did anybody try and maybe even document such an approach?

I tried layer 2 bonding as described here
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
but have to struggle with disappointing performance gains, a misbehaving 
switch layer and problems during PXE-DHCP-Boot.

Googling for a more controllable, all-linux, maybe layer 3 alternative, I 
encountered LARTC.
I think multirouting as in chapter 4.2.2 does not solve my problem, as I want 
to share bandwith for single large transfers, too.

I'd like to try the TEQL approach of chapter 10, but there are some open 
questions:

- How does the routing look like if I have 17 hosts connected by 6 interfaces 
each?

I think I cannot use the /31 net approach on a 1-to-1 basis, since I have 17 
machines on each subnet.
can I use /27 nets instead, allowing 30 hosts per subnet?

Or do I need a /31 subnet for each pair of machines, on each switch device,
which where a total of (17 x 16 /2) * 6 = 816  of /31 subnets?

Is this idea correct
-  one IP-addess for teql0 and 6 x 1 IP for eth0 ... eth5 on each host
	equals 7 x 17 = 119 IP addresses in total
- a route for each target on any physical interface on any host, pointing to 
the counterpart on the same subnet like

route add -host <teql-IP-on-target> gw <matching-dev-IP-on-target>

This still adds up to 16 peers x 6 Interfaces = 96 routes on each host. 
How does this affect performance?
Of course I can script this, but is there a more "elegant" way?
Like calculated / OR-ed filter addresses?

- can I continue to use the pyhsical links directly, particularly for 
PXE-booting?

- can I keep the switch configuration as one large network and let ARP/ layer 
3 sort out the details, or is it necessary/advantageous to configure all 
layer 3 subnets as seperate layer 2 Vlans as well?
Or do I even need 816 vlans for 816 /31 subnets on a peer-to-peer-basis?

- the clients run diskless on nfsroot, which is established by the dracut boot 
process.
So either I have to establish the whole teql within dracut during boot, or I 
have to reconfigure the nework after boot, without dropping the running 
nfsroot. Is this possible? 

- I only find reports and advices for 2.X kernels on the list archives.
Are there any advances on the TCP tuning issues in recent kernels?

- can I expect a performance gain at all, or will the additional CPU overhead 
outweight the gain in badwith?

- what are the recommended tools for testing and tuning?

========================
what I have done so far:

I'm just going to build a "poor man's beowulf cluster" from a bunch of used 
server parts, sourced on ebay.

So I end up with a HP blade center with 16 blade servers in it, each equipped 
with 6 x 1 GBit ehternet ports.
They are linked by HP Virtual connect ("VC") switch units, in the way that 
there are 6 VC, each with one port to every one of the blade servers.
This mapping is hardwired by the blade center design.
The VC ist administered and advertised like one large manageable switch, but 
with caveats, see below.

The whole thing is connected to the outside world via a consumer grade PC 
acting as a gateway and file server with 2x4=8 Gbit ethernet for the cluster 
side.

All boxes run on debian wheezy, with 3.19.0 vanilla on the gateway and 
debian 3.16.7-ckt4-3~bpo70+1 at the blades. 
Blades are bootet over DHCP/PXE/TFTP/nfsroot

Of course I would like to utilize the full available network bandwith for 
interprocess communication.

My first try was linux bonding with 802.3ad bonding policy.
see
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Bonding_Driver_Options

However, all traffic goes over one Interface only.
Maximum throughput is ~ 900 MBit / s.

Googling the issue, I learned that VC does not support LACP bonding across 
different VC modules, so they are only "little-bit-stackable-switches".

Next try was bonding with balance-rr as given here:
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology

To get the whole symmetry described there, 
I connected the external gateway with 6 ethernet ports to each of the 
VC-modules on a 1:1 basis. However, this breaks PXE booting, since the PXE 
machine does not appear to support bonding, so even the first DHCP breaks.

Current best setting is now having the blades on balancing-rr and the gateway 
connected by 8 parallel Gbit-links to one single VC-device and using LACP / 
802.3ad on this.

However, performance is still far beyond expectation:
~ 2.5 GBit between two blades, using nfs copy of 3 GBit files located in 
ramdisk
~ 0.9 GBit between server and blade via nfs copy
~ 2,8 GBit running  dbench -D /home 50 parallel on 16 clients 

I partially understand the last 2 figures as limitations of the 802.3ad LACP 
protocol.
I can see unequal load distribution in ifconfig stats.
I can watch periodical ups and downs during the 5 min dbench run, so I suspect 
some kind of a TCP congestion issue.

I still do not understand the limitations of the direct blade-to-blade 
transfer using the round-robin-policy. According to ifconfig, both incoming 
and outgoing traffic is equally distributed over all physical links.
I'm afraid this has anything to do with TCP reordering / slow-down / 
congestion window.

Thank you for any pointer.

Wolfgang Rosner

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: TEQL for bonding Multi Gbit Ethernet in a cluster?
  2015-03-13 21:26 TEQL for bonding Multi Gbit Ethernet in a cluster? Wolfgang Rosner
@ 2015-03-14  0:39 ` Jay Vosburgh
  2015-03-14 17:44 ` Wolfgang Rosner
  2015-03-16  9:10 ` TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks Wolfgang Rosner
  2 siblings, 0 replies; 4+ messages in thread
From: Jay Vosburgh @ 2015-03-14  0:39 UTC (permalink / raw)
  To: lartc

Wolfgang Rosner <wrosner@tirnet.de> wrote:

>Hello,
>
>
>Can I use TEQL to aggreagate multiple Gbit ethernets in a multiple Switch 
>Topology across multiple hosts?
>In my example, 17 hosts each having 6 GBit ethernet cards?
>
>Did anybody try and maybe even document such an approach?
>
>I tried layer 2 bonding as described here
>http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
>but have to struggle with disappointing performance gains, a misbehaving 
>switch layer and problems during PXE-DHCP-Boot.

	That text in the bonding documentation is fairly old, and
describes a configuration that is not common today.  It worked at the
time because the three switches did not communicate, and the hardware of
the era delivered one packet per receive interrupt (think 10 Mb/sec).
The round-robin delivery of packets across interfaces generally stayed
in sync, as there was no packet coalescing on the receive side (no NAPI
in the kernel, either).  The switches could be cheap unmamaged switches,
as there were no channel groups on any particular switch, and no sharing
of MAC tables between them.

	It doesn't work well today, if for no other reason than
interrupt coalescing and NAPI on the receiver will induce serious out of
order delivery, and turning that off is not really an option.

>Googling for a more controllable, all-linux, maybe layer 3 alternative, I 
>encountered LARTC.
>I think multirouting as in chapter 4.2.2 does not solve my problem, as I want 
>to share bandwith for single large transfers, too.
>
>I'd like to try the TEQL approach of chapter 10, but there are some open 
>questions:
>
>- How does the routing look like if I have 17 hosts connected by 6 interfaces 
>each?
>
>I think I cannot use the /31 net approach on a 1-to-1 basis, since I have 17 
>machines on each subnet.
>can I use /27 nets instead, allowing 30 hosts per subnet?
>
>Or do I need a /31 subnet for each pair of machines, on each switch device,
>which where a total of (17 x 16 /2) * 6 = 816  of /31 subnets?
>
>Is this idea correct
>-  one IP-addess for teql0 and 6 x 1 IP for eth0 ... eth5 on each host
>	equals 7 x 17 = 119 IP addresses in total
>- a route for each target on any physical interface on any host, pointing to 
>the counterpart on the same subnet like
>
>route add -host <teql-IP-on-target> gw <matching-dev-IP-on-target>
>
>This still adds up to 16 peers x 6 Interfaces = 96 routes on each host. 
>How does this affect performance?
>Of course I can script this, but is there a more "elegant" way?
>Like calculated / OR-ed filter addresses?
>
>- can I continue to use the pyhsical links directly, particularly for 
>PXE-booting?
>
>- can I keep the switch configuration as one large network and let ARP/ layer 
>3 sort out the details, or is it necessary/advantageous to configure all 
>layer 3 subnets as seperate layer 2 Vlans as well?
>Or do I even need 816 vlans for 816 /31 subnets on a peer-to-peer-basis?
>
>- the clients run diskless on nfsroot, which is established by the dracut boot 
>process.
>So either I have to establish the whole teql within dracut during boot, or I 
>have to reconfigure the nework after boot, without dropping the running 
>nfsroot. Is this possible? 
>
>- I only find reports and advices for 2.X kernels on the list archives.
>Are there any advances on the TCP tuning issues in recent kernels?
>
>- can I expect a performance gain at all, or will the additional CPU overhead 
>outweight the gain in badwith?
>
>- what are the recommended tools for testing and tuning?
>
>========================>
>what I have done so far:
>
>I'm just going to build a "poor man's beowulf cluster" from a bunch of used 
>server parts, sourced on ebay.
>
>So I end up with a HP blade center with 16 blade servers in it, each equipped 
>with 6 x 1 GBit ehternet ports.
>They are linked by HP Virtual connect ("VC") switch units, in the way that 
>there are 6 VC, each with one port to every one of the blade servers.
>This mapping is hardwired by the blade center design.
>The VC ist administered and advertised like one large manageable switch, but 
>with caveats, see below.
>
>The whole thing is connected to the outside world via a consumer grade PC 
>acting as a gateway and file server with 2x4=8 Gbit ethernet for the cluster 
>side.
>
>All boxes run on debian wheezy, with 3.19.0 vanilla on the gateway and 
>debian 3.16.7-ckt4-3~bpo70+1 at the blades. 
>Blades are bootet over DHCP/PXE/TFTP/nfsroot
>
>Of course I would like to utilize the full available network bandwith for 
>interprocess communication.
>
>My first try was linux bonding with 802.3ad bonding policy.
>see
>http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Bonding_Driver_Options
>
>However, all traffic goes over one Interface only.
>Maximum throughput is ~ 900 MBit / s.
>
>Googling the issue, I learned that VC does not support LACP bonding across 
>different VC modules, so they are only "little-bit-stackable-switches".
>
>Next try was bonding with balance-rr as given here:
>http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
>
>To get the whole symmetry described there, 
>I connected the external gateway with 6 ethernet ports to each of the 
>VC-modules on a 1:1 basis. However, this breaks PXE booting, since the PXE 
>machine does not appear to support bonding, so even the first DHCP breaks.
>
>Current best setting is now having the blades on balancing-rr and the gateway 
>connected by 8 parallel Gbit-links to one single VC-device and using LACP / 
>802.3ad on this.

	If you're testing your single stream throughput through this
LACP aggregation, you'll be limited by the throughput of one member of
that aggregation, as LACP will not stripe traffic.

>However, performance is still far beyond expectation:
>~ 2.5 GBit between two blades, using nfs copy of 3 GBit files located in 
>ramdisk
>~ 0.9 GBit between server and blade via nfs copy
>~ 2,8 GBit running  dbench -D /home 50 parallel on 16 clients 
>
>I partially understand the last 2 figures as limitations of the 802.3ad LACP 
>protocol.

	Most link aggregation systems will keep packets for a given
conversation (connection) on just one aggregation member, specifically
to prevent reordering of packets.  On linux, the bonding balance-rr mode
is the exception; the other modes use some type of hash or assignment to
determine the interface to transmit on, and won't stripe across multiple
interfaces.

	Another issue is that, even if you round-robin from the host's
bond, if traffic has to transit through a switch aggregation (channel
group), it will rebalance the traffic on egress, and most likely funnel
it all back through a single switch port.

>I can see unequal load distribution in ifconfig stats.
>I can watch periodical ups and downs during the 5 min dbench run, so I suspect 
>some kind of a TCP congestion issue.
>
>I still do not understand the limitations of the direct blade-to-blade 
>transfer using the round-robin-policy. According to ifconfig, both incoming 
>and outgoing traffic is equally distributed over all physical links.
>I'm afraid this has anything to do with TCP reordering / slow-down / 
>congestion window.

	Depending on your kernel, etc, you may be able to inspect some
reordering detection counters; netstat -s may report them, e.g.,

% netstat -s|grep -i reord
    Detected reordering 20 times using time stamp

	or you can hunt for the raw values in /proc/net/netstat or use
nstat to print them:

TcpExt:  TCPFACKReorder                         0
TcpExt:  TCPSACKReorder                         0
TcpExt:  TCPRenoReorder                         0
TcpExt:  TCPTSReorder                          20

	You may also be able to tweak some interface paramaters and
improve things; I'll point you at this discussion from a few years ago:

http://lists.openwall.net/netdev/2011/08/25/88

	I haven't tried what's described in the email in the
one-switch-per-interface sort of arrangement that blade environments
impose, and never really got bonding to work well for load balancing in
those type of environments.

	One issue for production use was that if a switch port fails on
one of the switches, the other peers sending traffic into that switch
will lose any packets sent to the failed port because their local link
is up, even though a particular peer isn't reachable.  That brings up
various cascade failover sorts of problems, or just interconnecting all
of the switches, which then gets confused by the bond's traffic wherein
the source MAC is the same for all interfaces.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: TEQL for bonding Multi Gbit Ethernet in a cluster?
  2015-03-13 21:26 TEQL for bonding Multi Gbit Ethernet in a cluster? Wolfgang Rosner
  2015-03-14  0:39 ` Jay Vosburgh
@ 2015-03-14 17:44 ` Wolfgang Rosner
  2015-03-16  9:10 ` TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks Wolfgang Rosner
  2 siblings, 0 replies; 4+ messages in thread
From: Wolfgang Rosner @ 2015-03-14 17:44 UTC (permalink / raw)
  To: lartc

Hello, Jay, 

thanks for your prompt answer.

> 	You may also be able to tweak some interface paramaters and
> improve things; I'll point you at this discussion from a few years ago:
>
> http://lists.openwall.net/netdev/2011/08/25/88
OK.
I tried to tweak the rx-usecs as given there, but saw no reproducible 
difference. My systems default was 18, and I tried both 6 and 45.

Regarding the TSO et al issue, I think this topic entered already the default 
setting in recent systems:

root@blade-001:~# ethtool -k eth0 | grep offload
tcp-segmentation-offload: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
l2-fwd-offload: off [fixed]

However, following the hints in this article, I encountered the most obvious 
way to tweak throghput:
	 jumbo packages.

Setting mtu = 9000 on both sides, I get 
5200 MBit for netperf throughput, which is 86 % of theoretical maximum .
(was 4100 with mtu\x1500 before)

nfs transfer is at 3.4 GBit/s (was 2,7 GBit with mtu\x1500)
I had one encounter with 4.2 GBit, but cannot reproduce this

nfs options for the crossmunted /run/shm ramdisks are shown by mount as 

192.168.130.2:/shm on /cluster/shm/node002 type nfs4 
(rw,noatime,vers=4.0,rsize\x1048576,wsize\x1048576,namlen%5,soft,proto=tcp,port=0,timeo`0,retrans=1,sec=sys,clientaddr\x192.168.130.3,local_lock=none,addr\x192.168.130.2)

What I have configured in the automounter script:

$nfs_opts 
= "-fstype=nfs4,sec=sys,async,noatime,fg,soft,intr,retrans=1,retry=0" ;

So - I haven't conifugred the rsize/wsize.
As RTFM says, client and server agree on the highest possible values, and try 
to get to 1 MByte

Anyway, I get of topic, as this is not a nfs mailing list.


> % netstat -s|grep -i reord
>     Detected reordering 20 times using time stamp
>
> 	or you can hunt for the raw values in /proc/net/netstat or use
> nstat to print them:

Hm. I see figures, but how to put meaning onto them?

before:

root@blade-002:~# netstat -s|grep -i reord
    Detected reordering 1 times using FACK

root@blade-003:~# netstat -s|grep -i reord
    Detected reordering 1 times using FACK
    Detected reordering 1 times using SACK


now doing some work:
	copying a 4 GB file over nfs between ram disks:
	(from blade-003 to blade-002)
root@blade-002:~# time cp /cluster/shm/node003/random.002 /run/shm/random.002
	real    0m8.701s
	user    0m0.000s
	sys     0m4.816s


after:

root@blade-002:~# netstat -s|grep -i reord
    Detected reordering 1 times using FACK

root@blade-003:~# netstat -s|grep -i reord
    Detected reordering 2 times using FACK
    Detected reordering 234 times using SACK
    Detected reordering 7 times using time stamp


wouldn't I have expected the reordering problems on the receivers side?
But I see it on the sender - I double and triple checked this....


Just in case you have an eye for peculiarities I do not see:

sender side

root@blade-003:~# nstat
#kernel
IpInReceives                    721022             0.0
IpInDelivers                    721022             0.0
IpOutRequests                   550631             0.0
TcpActiveOpens                  1                  0.0
TcpPassiveOpens                 1                  0.0
TcpInSegs                       720990             0.0
TcpOutSegs                      2177539            0.0
TcpRetransSegs                  4566               0.0
UdpInDatagrams                  32                 0.0
UdpOutDatagrams                 2                  0.0
TcpExtDelayedACKs               33                 0.0
TcpExtTCPPrequeued              1                  0.0
TcpExtTCPHPHits                 2066               0.0
TcpExtTCPPureAcks               623665             0.0
TcpExtTCPHPAcks                 38636              0.0
TcpExtTCPSackRecovery           423                0.0
TcpExtTCPFACKReorder            1                  0.0
TcpExtTCPSACKReorder            233                0.0
TcpExtTCPTSReorder              7                  0.0
TcpExtTCPFullUndo               19                 0.0
TcpExtTCPPartialUndo            18                 0.0
TcpExtTCPDSACKUndo              336                0.0
TcpExtTCPFastRetrans            1642               0.0
TcpExtTCPForwardRetrans         2924               0.0
TcpExtTCPDSACKRecv              3942               0.0
TcpExtTCPDSACKOfoRecv           6                  0.0
TcpExtTCPDSACKIgnoredOld        15                 0.0
TcpExtTCPDSACKIgnoredNoUndo     177                0.0
TcpExtTCPSackShifted            62709              0.0
TcpExtTCPSackMerged             261712             0.0
TcpExtTCPSackShiftFallback      404717             0.0
TcpExtTCPRetransFail            43                 0.0
TcpExtTCPRcvCoalesce            536                0.0
TcpExtTCPOFOQueue               3                  0.0
TcpExtTCPSpuriousRtxHostQueues  605                0.0
TcpExtTCPAutoCorking            58763              0.0
TcpExtTCPOrigDataSent           2176191            0.0
IpExtInBcastPkts                30                 0.0
IpExtInOctets                   53708303           0.0
IpExtOutOctets                  3181655278         0.0
IpExtInBcastOctets              2280               0.0
IpExtInNoECTPkts                721719             0.0

receiver side:

root@blade-002:~# nstat
#kernel
IpInReceives                    750213             0.0
IpInAddrErrors                  2                  0.0
IpInDelivers                    750211             0.0
IpOutRequests                   751510             0.0
IcmpInErrors                    246                0.0
IcmpInCsumErrors                112                0.0
IcmpInTimeExcds                 224                0.0
IcmpInEchoReps                  3                  0.0
IcmpInTimestamps                19                 0.0
IcmpOutErrors                   246                0.0
IcmpOutTimeExcds                224                0.0
IcmpOutEchoReps                 19                 0.0
IcmpOutTimestamps               3                  0.0
IcmpMsgInType0                  19                 0.0
IcmpMsgInType3                  224                0.0
IcmpMsgInType8                  3                  0.0
IcmpMsgOutType0                 3                  0.0
IcmpMsgOutType3                 224                0.0
IcmpMsgOutType8                 19                 0.0
TcpActiveOpens                  118                0.0
TcpPassiveOpens                 10                 0.0
TcpAttemptFails                 112                0.0
TcpInSegs                       748966             0.0
TcpOutSegs                      751036             0.0
TcpRetransSegs                  129                0.0
TcpOutRsts                      2                  0.0
UdpInDatagrams                  871                0.0
UdpOutDatagrams                 289                0.0
Ip6OutRequests                  10                 0.0
Ip6OutMcastPkts                 16                 0.0
Ip6OutOctets                    688                0.0
Ip6OutMcastOctets               1144               0.0
Icmp6OutMsgs                    10                 0.0
Icmp6OutRouterSolicits          3                  0.0
Icmp6OutNeighborSolicits        1                  0.0
Icmp6OutMLDv2Reports            6                  0.0
Icmp6OutType133                 3                  0.0
Icmp6OutType135                 1                  0.0
Icmp6OutType143                 6                  0.0
TcpExtPruneCalled               3                  0.0
TcpExtTW                        3                  0.0
TcpExtDelayedACKs               372                0.0
TcpExtDelayedACKLocked          2                  0.0
TcpExtDelayedACKLost            3926               0.0
TcpExtTCPPrequeued              2                  0.0
TcpExtTCPHPHits                 42851              0.0
TcpExtTCPPureAcks               1056               0.0
TcpExtTCPHPAcks                 10889              0.0
TcpExtTCPSackRecovery           4                  0.0
TcpExtTCPFACKReorder            1                  0.0
TcpExtTCPDSACKUndo              2                  0.0
TcpExtTCPFastRetrans            11                 0.0
TcpExtTCPForwardRetrans         3                  0.0
TcpExtTCPTimeouts               113                0.0
TcpExtTCPLossProbes             2                  0.0
TcpExtTCPLossProbeRecovery      1                  0.0
TcpExtTCPRcvCollapsed           543                0.0
TcpExtTCPDSACKOldSent           3951               0.0
TcpExtTCPDSACKOfoSent           6                  0.0
TcpExtTCPDSACKRecv              13                 0.0
TcpExtTCPDSACKIgnoredNoUndo     1                  0.0
TcpExtTCPSackShifted            7                  0.0
TcpExtTCPSackMerged             23                 0.0
TcpExtTCPSackShiftFallback      72                 0.0
TcpExtTCPBacklogDrop            204                0.0
TcpExtTCPRcvCoalesce            44267              0.0
TcpExtTCPOFOQueue               487750             0.0
TcpExtTCPOFOMerge               6                  0.0
TcpExtTCPSpuriousRtxHostQueues  112                0.0
TcpExtTCPAutoCorking            1424               0.0
TcpExtTCPWantZeroWindowAdv      45                 0.0
TcpExtTCPSynRetrans             112                0.0
TcpExtTCPOrigDataSent           16466              0.0
IpExtInBcastPkts                710                0.0
IpExtInOctets                   3252256888         0.0
IpExtOutOctets                  57783538           0.0
IpExtInBcastOctets              75470              0.0
IpExtInNoECTPkts                2236348            0.0


Anyway, I could live with this figures which I get between bonding interfaces 
configured with balance-rr bonding.

However, when I switch over to the gateway, which is connected by a  802.3ad 
bonding policy link, performance sucks:

from rr to 802.3ad
root@cruncher:/cluster/etc/scripts/available#  time 
cp /cluster/shm/node003/random.002 /run/shm/
	real    1m20.708s
	user    0m0.000s
	sys     0m5.812s
=> 37 MByte / s = 300 GBit/s

from 802.3ad to rr
root@blade-002:~# time cp /cluster/shm/node000/random.002 /run/shm/random.002
	real    0m26.747s
	user    0m0.008s
	sys     0m4.256s
=> 111 MByte / s = 888 GBit/s




> >I tried layer 2 bonding as described here
(... searching for a )
>>  all-linux, maybe layer 3 alternative,

So maybe I'd leave the rr in place for peer-to-peer connections between the 
blades and just have a layer-3-teql -like thing to the gateway?

hm. but can this work?
balance-rr bonding is syncing all MAC on the bond slaves. So I'm afraid there 
is no longer a chance to mix it with assigning indidual IP's to the slave 
interfaces, right?
But when all Interfaces have the same MAC, distribution is left to the switch, 
which all the opacity problems I encountered.

So either I go all-layer-2 or all-layer-3, right?

> 	That text in the bonding documentation is fairly old, and
(...)
> 	It doesn't work well today, if for no other reason than
> interrupt coalescing and NAPI on the receiver will induce serious out of
> order delivery, and turning that off is not really an option.

well, as my figures above tell my, It's not that bad, as long as it can be 
configured undisturbed on both sides and matches the switch topology.


> >- How does the routing look like if I have 17 hosts connected by 6
> > interfaces each?

As long as this question is not worked out, I have no chance to test teql on 
my system, I'm afraid.


> >Current best setting is now having the blades on balancing-rr and the
> > gateway connected by 8 parallel Gbit-links to one single VC-device and
> > using LACP / 802.3ad on this.
>
> 	If you're testing your single stream throughput through this
> LACP aggregation, you'll be limited by the throughput of one member of
> that aggregation, as LACP will not stripe traffic.

I know.
That's the reason while I would like to do round robin.
What can I expect from teql as compared to rr-bonding and to LACP-bonding?


> 	Another issue is that, even if you round-robin from the host's
> bond, if traffic has to transit through a switch aggregation (channel
> group), it will rebalance the traffic on egress, and most likely funnel
> it all back through a single switch port.

Thats obvioulsy what happens in 
	blade <-> gateway connections
due to the asymmetric connection

In blade <-> blade peering, it works fine, as I wrote.


I Try to draw an ascii image of the topology:

+-+-+-+-+-+----blade-001
+-+-+-+-+-+----blade-002
+-+-+-+-+-+----blade-003
+-+-+-+-+-+----blade-004
+-+-+-+-+-+----blade-005
+-+-+-+-+-+----blade-006
+-+-+-+-+-+----blade-007
+-+-+-+-+-+----blade-008
+-+-+-+-+-+----blade-009
+-+-+-+-+-+----blade-010
+-+-+-+-+-+----blade-011
+-+-+-+-+-+----blade-012
+-+-+-+-+-+----blade-013
+-+-+-+-+-+----blade-014
+-+-+-+-+-+----blade-015
+-+-+-+-+-+----blade-016
+-------------eth2---gateway(aka cruncher)
+-------------eth3---gateway(aka cruncher)
+-------------eth4---gateway(aka cruncher)
+-------------eth5---gateway(aka cruncher)
+-------------eth6---gateway(aka cruncher)
+-------------eth7---gateway(aka cruncher)
+-------------eth8---gateway(aka cruncher)
+-------------eth9---gateway(aka cruncher)


Each + column is a VC-switching module
Blades have eth0 ... eth5 connected in the shown hadwired matrix way.

There are additional stacking links between the VC-switching modules not shown 
here.
But it looks like the shortest path algorithm keeps rr neatly ordered between 
blades.


But when I distribute the gateway connections equally to all switch modules, 
only one of them is "link-active", the others are shown as "link-failover".
Only when I connect all of them to a single VC and configure them using LACP, 
they are used in parallel. But not matching the round robing mode, right as 
you mention.



> one-switch-per-interface sort of arrangement that blade environments
> impose, and never really got bonding to work well for load balancing in
> those type of environments.
>
> 	One issue for production use was that if a switch port fails on
> one of the switches, the other peers sending traffic into that switch

Well, I think there are different goals in Hig-PERFORMACE-clustering as 
opposed to High-AVAILABILITY-clustering.

Most of "production use" referst to web server or enterprise system stuff, 
which are basically HA, I'd say. And thats what those boxes are optimsied 
for - see the link-failover issue above.

Setting up a new HPC-cluster with a bunch of dollar notes, I presumably would 
simply go for infiniband instead of ethernet (or at least 10 GB ethernet), 
but there's no budget way for that. I simly try to get best out of the stuff 
that I can pick up at the lower end of the food chain ;-)


HM, so what??
I'll try to read the HP docu stuff whether I can get rid of the failover 
behaviour. If I only could just rip off all the stacking links and let the VC 
modules each behave as a "good old cheap and silly" switch....



Wolfgang Rosner


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks
  2015-03-13 21:26 TEQL for bonding Multi Gbit Ethernet in a cluster? Wolfgang Rosner
  2015-03-14  0:39 ` Jay Vosburgh
  2015-03-14 17:44 ` Wolfgang Rosner
@ 2015-03-16  9:10 ` Wolfgang Rosner
  2 siblings, 0 replies; 4+ messages in thread
From: Wolfgang Rosner @ 2015-03-16  9:10 UTC (permalink / raw)
  To: lartc

Hello,

the good new in short: IT WORKS

I get 5.58 GBit / sec over 6 x 1 GBit between my blade nodes,
using layer 3 teql link aggregation:

root@blade-002:~# iperf -c 192.168.130.225
........
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  6.49 GBytes  5.58 Gbits/sec

The /27 net approach worked fine and straight forward.
Its a simple extension of the /31 approach described here
http://lartc.org/howto/lartc.loadshare.html

Just the default routes that come up when configuring the IP addresses.
I divided a /24 net into 8 chunks
- one for the boot configuration (PXE, nfsroot...)
- 6 for each parallel link subnets
- one for the teql subnet

root@blade-001:~# ip addr
....
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 
1000
    link/ether 00:22:64:06:9b:7a brd ff:ff:ff:ff:ff:ff
    inet 192.168.130.1/27 brd 192.168.130.31 scope global eth0
    inet 192.168.130.33/27 scope global eth0:0
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 
1000
    link/ether 00:22:64:06:db:4c brd ff:ff:ff:ff:ff:ff
    inet 192.168.130.65/27 scope global eth1
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 
1000
    link/ether 00:21:5a:af:8e:40 brd ff:ff:ff:ff:ff:ff
    inet 192.168.130.97/27 scope global eth2
....
7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 
1000
    link/ether 00:21:5a:af:8e:43 brd ff:ff:ff:ff:ff:ff
    inet 192.168.130.193/27 scope global eth5
8: teql0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state 
UNKNOWN qlen 100
    link/void
    inet 192.168.130.225/27 scope global teql0
       valid_lft forever preferred_lft forever

(boring lines deleted)

Jumbo frames (mtu = 9000) are essential, they incrase throughput from ~ 3 GBit 
(aka 50 % of theoretical maximum) to > 5.5 (aka > 90 %)

So far so good:
I can combine the performance of layer 2 aggregation (bonding) with layer 3 
control of whats going on, getting clamps on nasty switch behaviour.
At least, so I hoped.


== QIRKS ==

But when it gets to transfer between the blade nodes and the external gateway, 
things get funny again.

This is how the network now looks like:
The gateway aka cruncher is connected one-by-one Gbit cable to each of the six 
VC swithces in the blade enclosure. For each VC bay (matching  the 
physical /27 subnets) I configured a separete vlan to convince VC to treat 
the uplinks as parallel, not as failover.


+-------------eth4---gateway(aka cruncher)
| +-------------eth5---gateway(aka cruncher)
| | +-------------eth6---gateway(aka cruncher)
| | | +-------------eth7---gateway(aka cruncher)
| | | | +-------------eth8---gateway(aka cruncher)
| | | | | +-------------eth9---gateway(aka cruncher)
+-+-+-+-+-+----blade-001
+-+-+-+-+-+----blade-002
+-+-+-+-+-+----blade-003

Straight implementation of above scheme on the gateway yields not more than ~ 
2 GBit. 
So, some aggregation happens, but far from the 6 GBit maximum.

ifconfig and wireshark show traffic coming equally over all 6 lines.
But with an awful lots of retransmits.
Well, maybe that wireshark gets confused by teql and fails matching packets 
since they go over different interfaces, but thats another issue, not primary 
here.

After lots of googling, I pinned the symptom down to this issue:

# for i in `seq 2 9`; do ethtool -S eth$i | grep rx_missed_errors ; done
     rx_missed_errors: 0
     rx_missed_errors: 0
     rx_missed_errors: 0
     rx_missed_errors: 0
     rx_missed_errors: 29159 
     rx_missed_errors: 28619
     rx_missed_errors: 9263
     rx_missed_errors: 23306



from
http://osdir.com/ml/linux.drivers.e1000.devel/2007-11/msg00133.html

---<quote>--------------------
you are running out of bus bandwidth (which is why increasing
descriptors doesn't help). rx_missed_errors occur when you run out of
fifo on the adapter itself, indicating the bus can't be attained for
long enough to keep the data rate up.
---</quote>--------------------


eth2 .. eth5 and eth6 ... eth9 are a quad port 82571EB Gigabit Ethernet each.

extracted from lspci I find

0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
        Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter

07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
        Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port 
Gigabit Server Adapter

'   +-0a.0-[05-08]----00.0-[06-08]--+-02.0-[07]--+-00.0
'   |                               |            \-00.1
'   |                               \-04.0-[08]--+-00.0
'   |                                            \-00.1
'   +-0b.0-[09]--+-00.0           |            \-00.1
'   +-0d.0-[0a-0d]----00.0-[0b-0d]--+-00.0-[0c]--+-00.0
'   |                               |            \-00.1
'   |                               \-01.0-[0d]--+-00.0
'   |                                            \-00.1


so both adaptors have the same chipset, same driver, similar bus connectivity 
and announce identical PCI bus bandwith:
	'LnkSta: Speed 2.5GT/s, Width x4'

believing http://en.wikipedia.org/wiki/PCI_Express
this comes out to 8 Gbit /s, which should basically suffice, I think.
And on the "good" NIC, it actually does, obviously:

To check, and to increase safety head, I switched 2 cables from the "buggy" 
NIC to the "healthy" one - and kept link konfig matching, of course. 

and - alas - we get up from ~2 GBit to > 3 GBit.
Still thousands of  rx_missed_errors
in the "bad" NIC,  which has only to work for 2 GBit connections now, and 
still zero  of  rx_missed_errors for the "good" NIC , which carries 4 GBit 
active now.

Further googling and tweaking memory limits in
	/proc/sys/net/ipv4/tcp_*mem
and 
 	/proc/sys/net/core/*mem*
showed no difference.

What helped, was to incrase the "TCP window size" on the iperf server side 
from 
	"TCP window size: 85.3 KByte (default)"
to a value between 512K and 2 M

root@cruncher:/cluster/etc/network# iperf -s -w1M
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  4] local 192.168.130.254 port 5001 connected with 192.168.130.226 port 
33775
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  5.06 GBytes  4.35 Gbits/sec

Now we are over 70 % of theoretical maximum.
However, neither  do  I really understand it, nor do I know how to transfer 
this window size setting to other applications.

I think the TCP window size is just a workaround for underlying problems, 
because
- still lots of  rx_missed_errors for eth6 and eth7
- the blade-blade connection with 5.6 GBit works even better without any 
tweaking with small TCP window size:


root@blade-001:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.130.225 port 5001 connected with 192.168.130.226 port 
49581
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  6.49 GBytes  5.58 Gbits/sec



Possible causes on my list

- firmware problem (NICs, Mainboard)
- hardware problem (NICs, Mainboard)
 -some realy weird hidden tweak paramater
- conceptual limitation of hardware design
 -some realy weird hidden tweak paramater
- driver problem
- kernel / scheduling issue / IRQ / race...whatever?
- still the nasty VC blade switch?
-  any more?


The gateway mainboard is a SABERTOOTH 990FX R2.0
[AMD/ATI] RD890 PCI to PCI bridge (external gfx1 port A)
- consumer grade, but quite recent -
Gateway CPU is a AMD FX-8320 8 Core
Linux cruncher 3.19.0 #1 SMP Tue Mar 3 19:05:04 CET 2015 x86_64 GNU/Linux

The blade nodes are HP blades 460c G1
chipset Intel 5000
- enterprise grade, but quite some years now, I suppose -
CPU 2 x Xeon E5430 quad
Linux blade-002.crunchnet 3.16.0-0.bpo.4-amd64 #1 SMP Debian 
3.16.7-ckt4-3~bpo70+1 (2015-02-12) x86_64 GNU/Linux


Testing memory bandwith with mbw (as a first measure of system bus thruput), 
the Gateway outperforms the blades by a factor of two

root@blade-002:~# mbw -n1 1000
AVG     Method: MEMCPY  Elapsed: 0.61679        MiB: 1000.00000 Copy: 1621.300 
MiB/s
AVG     Method: DUMB    Elapsed: 0.51892        MiB: 1000.00000 Copy: 1927.068 
MiB/s
AVG     Method: MCBLOCK Elapsed: 0.39211        MiB: 1000.00000 Copy: 2550.311 
MiB/s

root@cruncher...#  mbw -n1 1000
AVG     Method: MEMCPY  Elapsed: 0.27301        MiB: 1000.00000 Copy: 3662.923 
MiB/s
AVG     Method: DUMB    Elapsed: 0.19693        MiB: 1000.00000 Copy: 5077.972 
MiB/s
AVG     Method: MCBLOCK Elapsed: 0.19287        MiB: 1000.00000 Copy: 5184.947 
MiB/s

So, conceptually, I see no reason why from two nearly identical quad-GB 
adapters, one should fail so badly on the faster system.

again compared lspci line by line and found a tiny difference:

Hewlett-Packard Company NC364T.... (the 'bad')
        Region 0: Memory at fc400000 (32-bit, non-prefetchable) [size\x128K]
        Region 1: Memory at fc300000 (32-bit, non-prefetchable) [sizeQ2K]
        Region 2: I/O ports at 8000 [size2]

Intel Corporation PRO/1000 PT ...('the good')
        Region 0: Memory at fc5a0000 (32-bit, non-prefetchable) [size\x128K]
        Region 1: Memory at fc580000 (32-bit, non-prefetchable) [size\x128K]
        Region 2: I/O ports at 5020 [size2]

so the "Region 2" memory is 4x larger in the 'bad' NIC.
Any clue whether this may be related? 
Just an uneducated guess:
If it were some kind of pointer fifo into some buffer memory, the larger one 
might run out of referred buffer, while the smaller does not????

How to proceed from "Guess" to "Know" to "Cure"?

Anybody any idea?



===========
just to exclude the idiots error, before hitting the send button:
I switched the cables to the faulty NIC (after now only two were left)
and rate on the teql link went down from > 2 Gbit to ~ 340 Kbits/sec

So, yes, cabling was right before,
and yes, the scheme provides some fault tolerance, albeit with severe hits in 
performance.


Wolfgang Rosner


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-03-16  9:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-13 21:26 TEQL for bonding Multi Gbit Ethernet in a cluster? Wolfgang Rosner
2015-03-14  0:39 ` Jay Vosburgh
2015-03-14 17:44 ` Wolfgang Rosner
2015-03-16  9:10 ` TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks Wolfgang Rosner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.