From: Wolfgang Rosner <wrosner@tirnet.de>
To: lartc@vger.kernel.org
Subject: Re: TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks
Date: Mon, 16 Mar 2015 09:10:25 +0000 [thread overview]
Message-ID: <201503161010.26403.wrosner@tirnet.de> (raw)
In-Reply-To: <201503132226.39582.wrosner@tirnet.de>
Hello,
the good new in short: IT WORKS
I get 5.58 GBit / sec over 6 x 1 GBit between my blade nodes,
using layer 3 teql link aggregation:
root@blade-002:~# iperf -c 192.168.130.225
........
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 6.49 GBytes 5.58 Gbits/sec
The /27 net approach worked fine and straight forward.
Its a simple extension of the /31 approach described here
http://lartc.org/howto/lartc.loadshare.html
Just the default routes that come up when configuring the IP addresses.
I divided a /24 net into 8 chunks
- one for the boot configuration (PXE, nfsroot...)
- 6 for each parallel link subnets
- one for the teql subnet
root@blade-001:~# ip addr
....
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen
1000
link/ether 00:22:64:06:9b:7a brd ff:ff:ff:ff:ff:ff
inet 192.168.130.1/27 brd 192.168.130.31 scope global eth0
inet 192.168.130.33/27 scope global eth0:0
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen
1000
link/ether 00:22:64:06:db:4c brd ff:ff:ff:ff:ff:ff
inet 192.168.130.65/27 scope global eth1
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen
1000
link/ether 00:21:5a:af:8e:40 brd ff:ff:ff:ff:ff:ff
inet 192.168.130.97/27 scope global eth2
....
7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen
1000
link/ether 00:21:5a:af:8e:43 brd ff:ff:ff:ff:ff:ff
inet 192.168.130.193/27 scope global eth5
8: teql0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state
UNKNOWN qlen 100
link/void
inet 192.168.130.225/27 scope global teql0
valid_lft forever preferred_lft forever
(boring lines deleted)
Jumbo frames (mtu = 9000) are essential, they incrase throughput from ~ 3 GBit
(aka 50 % of theoretical maximum) to > 5.5 (aka > 90 %)
So far so good:
I can combine the performance of layer 2 aggregation (bonding) with layer 3
control of whats going on, getting clamps on nasty switch behaviour.
At least, so I hoped.
== QIRKS ==
But when it gets to transfer between the blade nodes and the external gateway,
things get funny again.
This is how the network now looks like:
The gateway aka cruncher is connected one-by-one Gbit cable to each of the six
VC swithces in the blade enclosure. For each VC bay (matching the
physical /27 subnets) I configured a separete vlan to convince VC to treat
the uplinks as parallel, not as failover.
+-------------eth4---gateway(aka cruncher)
| +-------------eth5---gateway(aka cruncher)
| | +-------------eth6---gateway(aka cruncher)
| | | +-------------eth7---gateway(aka cruncher)
| | | | +-------------eth8---gateway(aka cruncher)
| | | | | +-------------eth9---gateway(aka cruncher)
+-+-+-+-+-+----blade-001
+-+-+-+-+-+----blade-002
+-+-+-+-+-+----blade-003
Straight implementation of above scheme on the gateway yields not more than ~
2 GBit.
So, some aggregation happens, but far from the 6 GBit maximum.
ifconfig and wireshark show traffic coming equally over all 6 lines.
But with an awful lots of retransmits.
Well, maybe that wireshark gets confused by teql and fails matching packets
since they go over different interfaces, but thats another issue, not primary
here.
After lots of googling, I pinned the symptom down to this issue:
# for i in `seq 2 9`; do ethtool -S eth$i | grep rx_missed_errors ; done
rx_missed_errors: 0
rx_missed_errors: 0
rx_missed_errors: 0
rx_missed_errors: 0
rx_missed_errors: 29159
rx_missed_errors: 28619
rx_missed_errors: 9263
rx_missed_errors: 23306
from
http://osdir.com/ml/linux.drivers.e1000.devel/2007-11/msg00133.html
---<quote>--------------------
you are running out of bus bandwidth (which is why increasing
descriptors doesn't help). rx_missed_errors occur when you run out of
fifo on the adapter itself, indicating the bus can't be attained for
long enough to keep the data rate up.
---</quote>--------------------
eth2 .. eth5 and eth6 ... eth9 are a quad port 82571EB Gigabit Ethernet each.
extracted from lspci I find
0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
Gigabit Server Adapter
' +-0a.0-[05-08]----00.0-[06-08]--+-02.0-[07]--+-00.0
' | | \-00.1
' | \-04.0-[08]--+-00.0
' | \-00.1
' +-0b.0-[09]--+-00.0 | \-00.1
' +-0d.0-[0a-0d]----00.0-[0b-0d]--+-00.0-[0c]--+-00.0
' | | \-00.1
' | \-01.0-[0d]--+-00.0
' | \-00.1
so both adaptors have the same chipset, same driver, similar bus connectivity
and announce identical PCI bus bandwith:
'LnkSta: Speed 2.5GT/s, Width x4'
believing http://en.wikipedia.org/wiki/PCI_Express
this comes out to 8 Gbit /s, which should basically suffice, I think.
And on the "good" NIC, it actually does, obviously:
To check, and to increase safety head, I switched 2 cables from the "buggy"
NIC to the "healthy" one - and kept link konfig matching, of course.
and - alas - we get up from ~2 GBit to > 3 GBit.
Still thousands of rx_missed_errors
in the "bad" NIC, which has only to work for 2 GBit connections now, and
still zero of rx_missed_errors for the "good" NIC , which carries 4 GBit
active now.
Further googling and tweaking memory limits in
/proc/sys/net/ipv4/tcp_*mem
and
/proc/sys/net/core/*mem*
showed no difference.
What helped, was to incrase the "TCP window size" on the iperf server side
from
"TCP window size: 85.3 KByte (default)"
to a value between 512K and 2 M
root@cruncher:/cluster/etc/network# iperf -s -w1M
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.130.254 port 5001 connected with 192.168.130.226 port
33775
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 5.06 GBytes 4.35 Gbits/sec
Now we are over 70 % of theoretical maximum.
However, neither do I really understand it, nor do I know how to transfer
this window size setting to other applications.
I think the TCP window size is just a workaround for underlying problems,
because
- still lots of rx_missed_errors for eth6 and eth7
- the blade-blade connection with 5.6 GBit works even better without any
tweaking with small TCP window size:
root@blade-001:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.130.225 port 5001 connected with 192.168.130.226 port
49581
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 6.49 GBytes 5.58 Gbits/sec
Possible causes on my list
- firmware problem (NICs, Mainboard)
- hardware problem (NICs, Mainboard)
-some realy weird hidden tweak paramater
- conceptual limitation of hardware design
-some realy weird hidden tweak paramater
- driver problem
- kernel / scheduling issue / IRQ / race...whatever?
- still the nasty VC blade switch?
- any more?
The gateway mainboard is a SABERTOOTH 990FX R2.0
[AMD/ATI] RD890 PCI to PCI bridge (external gfx1 port A)
- consumer grade, but quite recent -
Gateway CPU is a AMD FX-8320 8 Core
Linux cruncher 3.19.0 #1 SMP Tue Mar 3 19:05:04 CET 2015 x86_64 GNU/Linux
The blade nodes are HP blades 460c G1
chipset Intel 5000
- enterprise grade, but quite some years now, I suppose -
CPU 2 x Xeon E5430 quad
Linux blade-002.crunchnet 3.16.0-0.bpo.4-amd64 #1 SMP Debian
3.16.7-ckt4-3~bpo70+1 (2015-02-12) x86_64 GNU/Linux
Testing memory bandwith with mbw (as a first measure of system bus thruput),
the Gateway outperforms the blades by a factor of two
root@blade-002:~# mbw -n1 1000
AVG Method: MEMCPY Elapsed: 0.61679 MiB: 1000.00000 Copy: 1621.300
MiB/s
AVG Method: DUMB Elapsed: 0.51892 MiB: 1000.00000 Copy: 1927.068
MiB/s
AVG Method: MCBLOCK Elapsed: 0.39211 MiB: 1000.00000 Copy: 2550.311
MiB/s
root@cruncher...# mbw -n1 1000
AVG Method: MEMCPY Elapsed: 0.27301 MiB: 1000.00000 Copy: 3662.923
MiB/s
AVG Method: DUMB Elapsed: 0.19693 MiB: 1000.00000 Copy: 5077.972
MiB/s
AVG Method: MCBLOCK Elapsed: 0.19287 MiB: 1000.00000 Copy: 5184.947
MiB/s
So, conceptually, I see no reason why from two nearly identical quad-GB
adapters, one should fail so badly on the faster system.
again compared lspci line by line and found a tiny difference:
Hewlett-Packard Company NC364T.... (the 'bad')
Region 0: Memory at fc400000 (32-bit, non-prefetchable) [size\x128K]
Region 1: Memory at fc300000 (32-bit, non-prefetchable) [sizeQ2K]
Region 2: I/O ports at 8000 [size2]
Intel Corporation PRO/1000 PT ...('the good')
Region 0: Memory at fc5a0000 (32-bit, non-prefetchable) [size\x128K]
Region 1: Memory at fc580000 (32-bit, non-prefetchable) [size\x128K]
Region 2: I/O ports at 5020 [size2]
so the "Region 2" memory is 4x larger in the 'bad' NIC.
Any clue whether this may be related?
Just an uneducated guess:
If it were some kind of pointer fifo into some buffer memory, the larger one
might run out of referred buffer, while the smaller does not????
How to proceed from "Guess" to "Know" to "Cure"?
Anybody any idea?
===========
just to exclude the idiots error, before hitting the send button:
I switched the cables to the faulty NIC (after now only two were left)
and rate on the teql link went down from > 2 Gbit to ~ 340 Kbits/sec
So, yes, cabling was right before,
and yes, the scheme provides some fault tolerance, albeit with severe hits in
performance.
Wolfgang Rosner
prev parent reply other threads:[~2015-03-16 9:10 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-13 21:26 TEQL for bonding Multi Gbit Ethernet in a cluster? Wolfgang Rosner
2015-03-14 0:39 ` Jay Vosburgh
2015-03-14 17:44 ` Wolfgang Rosner
2015-03-16 9:10 ` Wolfgang Rosner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201503161010.26403.wrosner@tirnet.de \
--to=wrosner@tirnet.de \
--cc=lartc@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.