From: Wolfgang Rosner <wrosner@tirnet.de>
To: lartc@vger.kernel.org
Subject: TEQL for bonding Multi Gbit Ethernet in a cluster?
Date: Fri, 13 Mar 2015 21:26:38 +0000 [thread overview]
Message-ID: <201503132226.39582.wrosner@tirnet.de> (raw)
Hello,
Can I use TEQL to aggreagate multiple Gbit ethernets in a multiple Switch
Topology across multiple hosts?
In my example, 17 hosts each having 6 GBit ethernet cards?
Did anybody try and maybe even document such an approach?
I tried layer 2 bonding as described here
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
but have to struggle with disappointing performance gains, a misbehaving
switch layer and problems during PXE-DHCP-Boot.
Googling for a more controllable, all-linux, maybe layer 3 alternative, I
encountered LARTC.
I think multirouting as in chapter 4.2.2 does not solve my problem, as I want
to share bandwith for single large transfers, too.
I'd like to try the TEQL approach of chapter 10, but there are some open
questions:
- How does the routing look like if I have 17 hosts connected by 6 interfaces
each?
I think I cannot use the /31 net approach on a 1-to-1 basis, since I have 17
machines on each subnet.
can I use /27 nets instead, allowing 30 hosts per subnet?
Or do I need a /31 subnet for each pair of machines, on each switch device,
which where a total of (17 x 16 /2) * 6 = 816 of /31 subnets?
Is this idea correct
- one IP-addess for teql0 and 6 x 1 IP for eth0 ... eth5 on each host
equals 7 x 17 = 119 IP addresses in total
- a route for each target on any physical interface on any host, pointing to
the counterpart on the same subnet like
route add -host <teql-IP-on-target> gw <matching-dev-IP-on-target>
This still adds up to 16 peers x 6 Interfaces = 96 routes on each host.
How does this affect performance?
Of course I can script this, but is there a more "elegant" way?
Like calculated / OR-ed filter addresses?
- can I continue to use the pyhsical links directly, particularly for
PXE-booting?
- can I keep the switch configuration as one large network and let ARP/ layer
3 sort out the details, or is it necessary/advantageous to configure all
layer 3 subnets as seperate layer 2 Vlans as well?
Or do I even need 816 vlans for 816 /31 subnets on a peer-to-peer-basis?
- the clients run diskless on nfsroot, which is established by the dracut boot
process.
So either I have to establish the whole teql within dracut during boot, or I
have to reconfigure the nework after boot, without dropping the running
nfsroot. Is this possible?
- I only find reports and advices for 2.X kernels on the list archives.
Are there any advances on the TCP tuning issues in recent kernels?
- can I expect a performance gain at all, or will the additional CPU overhead
outweight the gain in badwith?
- what are the recommended tools for testing and tuning?
========================
what I have done so far:
I'm just going to build a "poor man's beowulf cluster" from a bunch of used
server parts, sourced on ebay.
So I end up with a HP blade center with 16 blade servers in it, each equipped
with 6 x 1 GBit ehternet ports.
They are linked by HP Virtual connect ("VC") switch units, in the way that
there are 6 VC, each with one port to every one of the blade servers.
This mapping is hardwired by the blade center design.
The VC ist administered and advertised like one large manageable switch, but
with caveats, see below.
The whole thing is connected to the outside world via a consumer grade PC
acting as a gateway and file server with 2x4=8 Gbit ethernet for the cluster
side.
All boxes run on debian wheezy, with 3.19.0 vanilla on the gateway and
debian 3.16.7-ckt4-3~bpo70+1 at the blades.
Blades are bootet over DHCP/PXE/TFTP/nfsroot
Of course I would like to utilize the full available network bandwith for
interprocess communication.
My first try was linux bonding with 802.3ad bonding policy.
see
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Bonding_Driver_Options
However, all traffic goes over one Interface only.
Maximum throughput is ~ 900 MBit / s.
Googling the issue, I learned that VC does not support LACP bonding across
different VC modules, so they are only "little-bit-stackable-switches".
Next try was bonding with balance-rr as given here:
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
To get the whole symmetry described there,
I connected the external gateway with 6 ethernet ports to each of the
VC-modules on a 1:1 basis. However, this breaks PXE booting, since the PXE
machine does not appear to support bonding, so even the first DHCP breaks.
Current best setting is now having the blades on balancing-rr and the gateway
connected by 8 parallel Gbit-links to one single VC-device and using LACP /
802.3ad on this.
However, performance is still far beyond expectation:
~ 2.5 GBit between two blades, using nfs copy of 3 GBit files located in
ramdisk
~ 0.9 GBit between server and blade via nfs copy
~ 2,8 GBit running dbench -D /home 50 parallel on 16 clients
I partially understand the last 2 figures as limitations of the 802.3ad LACP
protocol.
I can see unequal load distribution in ifconfig stats.
I can watch periodical ups and downs during the 5 min dbench run, so I suspect
some kind of a TCP congestion issue.
I still do not understand the limitations of the direct blade-to-blade
transfer using the round-robin-policy. According to ifconfig, both incoming
and outgoing traffic is equally distributed over all physical links.
I'm afraid this has anything to do with TCP reordering / slow-down /
congestion window.
Thank you for any pointer.
Wolfgang Rosner
next reply other threads:[~2015-03-13 21:26 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-13 21:26 Wolfgang Rosner [this message]
2015-03-14 0:39 ` TEQL for bonding Multi Gbit Ethernet in a cluster? Jay Vosburgh
2015-03-14 17:44 ` Wolfgang Rosner
2015-03-16 9:10 ` TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks Wolfgang Rosner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201503132226.39582.wrosner@tirnet.de \
--to=wrosner@tirnet.de \
--cc=lartc@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.