From: Wolfgang Rosner <wrosner@tirnet.de>
To: lartc@vger.kernel.org
Subject: Re: TEQL for bonding Multi Gbit Ethernet in a cluster?
Date: Sat, 14 Mar 2015 17:44:03 +0000 [thread overview]
Message-ID: <201503141844.04588.wrosner@tirnet.de> (raw)
In-Reply-To: <201503132226.39582.wrosner@tirnet.de>
Hello, Jay,
thanks for your prompt answer.
> You may also be able to tweak some interface paramaters and
> improve things; I'll point you at this discussion from a few years ago:
>
> http://lists.openwall.net/netdev/2011/08/25/88
OK.
I tried to tweak the rx-usecs as given there, but saw no reproducible
difference. My systems default was 18, and I tried both 6 and 45.
Regarding the TSO et al issue, I think this topic entered already the default
setting in recent systems:
root@blade-001:~# ethtool -k eth0 | grep offload
tcp-segmentation-offload: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
l2-fwd-offload: off [fixed]
However, following the hints in this article, I encountered the most obvious
way to tweak throghput:
jumbo packages.
Setting mtu = 9000 on both sides, I get
5200 MBit for netperf throughput, which is 86 % of theoretical maximum .
(was 4100 with mtu\x1500 before)
nfs transfer is at 3.4 GBit/s (was 2,7 GBit with mtu\x1500)
I had one encounter with 4.2 GBit, but cannot reproduce this
nfs options for the crossmunted /run/shm ramdisks are shown by mount as
192.168.130.2:/shm on /cluster/shm/node002 type nfs4
(rw,noatime,vers=4.0,rsize\x1048576,wsize\x1048576,namlen%5,soft,proto=tcp,port=0,timeo`0,retrans=1,sec=sys,clientaddr\x192.168.130.3,local_lock=none,addr\x192.168.130.2)
What I have configured in the automounter script:
$nfs_opts
= "-fstype=nfs4,sec=sys,async,noatime,fg,soft,intr,retrans=1,retry=0" ;
So - I haven't conifugred the rsize/wsize.
As RTFM says, client and server agree on the highest possible values, and try
to get to 1 MByte
Anyway, I get of topic, as this is not a nfs mailing list.
> % netstat -s|grep -i reord
> Detected reordering 20 times using time stamp
>
> or you can hunt for the raw values in /proc/net/netstat or use
> nstat to print them:
Hm. I see figures, but how to put meaning onto them?
before:
root@blade-002:~# netstat -s|grep -i reord
Detected reordering 1 times using FACK
root@blade-003:~# netstat -s|grep -i reord
Detected reordering 1 times using FACK
Detected reordering 1 times using SACK
now doing some work:
copying a 4 GB file over nfs between ram disks:
(from blade-003 to blade-002)
root@blade-002:~# time cp /cluster/shm/node003/random.002 /run/shm/random.002
real 0m8.701s
user 0m0.000s
sys 0m4.816s
after:
root@blade-002:~# netstat -s|grep -i reord
Detected reordering 1 times using FACK
root@blade-003:~# netstat -s|grep -i reord
Detected reordering 2 times using FACK
Detected reordering 234 times using SACK
Detected reordering 7 times using time stamp
wouldn't I have expected the reordering problems on the receivers side?
But I see it on the sender - I double and triple checked this....
Just in case you have an eye for peculiarities I do not see:
sender side
root@blade-003:~# nstat
#kernel
IpInReceives 721022 0.0
IpInDelivers 721022 0.0
IpOutRequests 550631 0.0
TcpActiveOpens 1 0.0
TcpPassiveOpens 1 0.0
TcpInSegs 720990 0.0
TcpOutSegs 2177539 0.0
TcpRetransSegs 4566 0.0
UdpInDatagrams 32 0.0
UdpOutDatagrams 2 0.0
TcpExtDelayedACKs 33 0.0
TcpExtTCPPrequeued 1 0.0
TcpExtTCPHPHits 2066 0.0
TcpExtTCPPureAcks 623665 0.0
TcpExtTCPHPAcks 38636 0.0
TcpExtTCPSackRecovery 423 0.0
TcpExtTCPFACKReorder 1 0.0
TcpExtTCPSACKReorder 233 0.0
TcpExtTCPTSReorder 7 0.0
TcpExtTCPFullUndo 19 0.0
TcpExtTCPPartialUndo 18 0.0
TcpExtTCPDSACKUndo 336 0.0
TcpExtTCPFastRetrans 1642 0.0
TcpExtTCPForwardRetrans 2924 0.0
TcpExtTCPDSACKRecv 3942 0.0
TcpExtTCPDSACKOfoRecv 6 0.0
TcpExtTCPDSACKIgnoredOld 15 0.0
TcpExtTCPDSACKIgnoredNoUndo 177 0.0
TcpExtTCPSackShifted 62709 0.0
TcpExtTCPSackMerged 261712 0.0
TcpExtTCPSackShiftFallback 404717 0.0
TcpExtTCPRetransFail 43 0.0
TcpExtTCPRcvCoalesce 536 0.0
TcpExtTCPOFOQueue 3 0.0
TcpExtTCPSpuriousRtxHostQueues 605 0.0
TcpExtTCPAutoCorking 58763 0.0
TcpExtTCPOrigDataSent 2176191 0.0
IpExtInBcastPkts 30 0.0
IpExtInOctets 53708303 0.0
IpExtOutOctets 3181655278 0.0
IpExtInBcastOctets 2280 0.0
IpExtInNoECTPkts 721719 0.0
receiver side:
root@blade-002:~# nstat
#kernel
IpInReceives 750213 0.0
IpInAddrErrors 2 0.0
IpInDelivers 750211 0.0
IpOutRequests 751510 0.0
IcmpInErrors 246 0.0
IcmpInCsumErrors 112 0.0
IcmpInTimeExcds 224 0.0
IcmpInEchoReps 3 0.0
IcmpInTimestamps 19 0.0
IcmpOutErrors 246 0.0
IcmpOutTimeExcds 224 0.0
IcmpOutEchoReps 19 0.0
IcmpOutTimestamps 3 0.0
IcmpMsgInType0 19 0.0
IcmpMsgInType3 224 0.0
IcmpMsgInType8 3 0.0
IcmpMsgOutType0 3 0.0
IcmpMsgOutType3 224 0.0
IcmpMsgOutType8 19 0.0
TcpActiveOpens 118 0.0
TcpPassiveOpens 10 0.0
TcpAttemptFails 112 0.0
TcpInSegs 748966 0.0
TcpOutSegs 751036 0.0
TcpRetransSegs 129 0.0
TcpOutRsts 2 0.0
UdpInDatagrams 871 0.0
UdpOutDatagrams 289 0.0
Ip6OutRequests 10 0.0
Ip6OutMcastPkts 16 0.0
Ip6OutOctets 688 0.0
Ip6OutMcastOctets 1144 0.0
Icmp6OutMsgs 10 0.0
Icmp6OutRouterSolicits 3 0.0
Icmp6OutNeighborSolicits 1 0.0
Icmp6OutMLDv2Reports 6 0.0
Icmp6OutType133 3 0.0
Icmp6OutType135 1 0.0
Icmp6OutType143 6 0.0
TcpExtPruneCalled 3 0.0
TcpExtTW 3 0.0
TcpExtDelayedACKs 372 0.0
TcpExtDelayedACKLocked 2 0.0
TcpExtDelayedACKLost 3926 0.0
TcpExtTCPPrequeued 2 0.0
TcpExtTCPHPHits 42851 0.0
TcpExtTCPPureAcks 1056 0.0
TcpExtTCPHPAcks 10889 0.0
TcpExtTCPSackRecovery 4 0.0
TcpExtTCPFACKReorder 1 0.0
TcpExtTCPDSACKUndo 2 0.0
TcpExtTCPFastRetrans 11 0.0
TcpExtTCPForwardRetrans 3 0.0
TcpExtTCPTimeouts 113 0.0
TcpExtTCPLossProbes 2 0.0
TcpExtTCPLossProbeRecovery 1 0.0
TcpExtTCPRcvCollapsed 543 0.0
TcpExtTCPDSACKOldSent 3951 0.0
TcpExtTCPDSACKOfoSent 6 0.0
TcpExtTCPDSACKRecv 13 0.0
TcpExtTCPDSACKIgnoredNoUndo 1 0.0
TcpExtTCPSackShifted 7 0.0
TcpExtTCPSackMerged 23 0.0
TcpExtTCPSackShiftFallback 72 0.0
TcpExtTCPBacklogDrop 204 0.0
TcpExtTCPRcvCoalesce 44267 0.0
TcpExtTCPOFOQueue 487750 0.0
TcpExtTCPOFOMerge 6 0.0
TcpExtTCPSpuriousRtxHostQueues 112 0.0
TcpExtTCPAutoCorking 1424 0.0
TcpExtTCPWantZeroWindowAdv 45 0.0
TcpExtTCPSynRetrans 112 0.0
TcpExtTCPOrigDataSent 16466 0.0
IpExtInBcastPkts 710 0.0
IpExtInOctets 3252256888 0.0
IpExtOutOctets 57783538 0.0
IpExtInBcastOctets 75470 0.0
IpExtInNoECTPkts 2236348 0.0
Anyway, I could live with this figures which I get between bonding interfaces
configured with balance-rr bonding.
However, when I switch over to the gateway, which is connected by a 802.3ad
bonding policy link, performance sucks:
from rr to 802.3ad
root@cruncher:/cluster/etc/scripts/available# time
cp /cluster/shm/node003/random.002 /run/shm/
real 1m20.708s
user 0m0.000s
sys 0m5.812s
=> 37 MByte / s = 300 GBit/s
from 802.3ad to rr
root@blade-002:~# time cp /cluster/shm/node000/random.002 /run/shm/random.002
real 0m26.747s
user 0m0.008s
sys 0m4.256s
=> 111 MByte / s = 888 GBit/s
> >I tried layer 2 bonding as described here
(... searching for a )
>> all-linux, maybe layer 3 alternative,
So maybe I'd leave the rr in place for peer-to-peer connections between the
blades and just have a layer-3-teql -like thing to the gateway?
hm. but can this work?
balance-rr bonding is syncing all MAC on the bond slaves. So I'm afraid there
is no longer a chance to mix it with assigning indidual IP's to the slave
interfaces, right?
But when all Interfaces have the same MAC, distribution is left to the switch,
which all the opacity problems I encountered.
So either I go all-layer-2 or all-layer-3, right?
> That text in the bonding documentation is fairly old, and
(...)
> It doesn't work well today, if for no other reason than
> interrupt coalescing and NAPI on the receiver will induce serious out of
> order delivery, and turning that off is not really an option.
well, as my figures above tell my, It's not that bad, as long as it can be
configured undisturbed on both sides and matches the switch topology.
> >- How does the routing look like if I have 17 hosts connected by 6
> > interfaces each?
As long as this question is not worked out, I have no chance to test teql on
my system, I'm afraid.
> >Current best setting is now having the blades on balancing-rr and the
> > gateway connected by 8 parallel Gbit-links to one single VC-device and
> > using LACP / 802.3ad on this.
>
> If you're testing your single stream throughput through this
> LACP aggregation, you'll be limited by the throughput of one member of
> that aggregation, as LACP will not stripe traffic.
I know.
That's the reason while I would like to do round robin.
What can I expect from teql as compared to rr-bonding and to LACP-bonding?
> Another issue is that, even if you round-robin from the host's
> bond, if traffic has to transit through a switch aggregation (channel
> group), it will rebalance the traffic on egress, and most likely funnel
> it all back through a single switch port.
Thats obvioulsy what happens in
blade <-> gateway connections
due to the asymmetric connection
In blade <-> blade peering, it works fine, as I wrote.
I Try to draw an ascii image of the topology:
+-+-+-+-+-+----blade-001
+-+-+-+-+-+----blade-002
+-+-+-+-+-+----blade-003
+-+-+-+-+-+----blade-004
+-+-+-+-+-+----blade-005
+-+-+-+-+-+----blade-006
+-+-+-+-+-+----blade-007
+-+-+-+-+-+----blade-008
+-+-+-+-+-+----blade-009
+-+-+-+-+-+----blade-010
+-+-+-+-+-+----blade-011
+-+-+-+-+-+----blade-012
+-+-+-+-+-+----blade-013
+-+-+-+-+-+----blade-014
+-+-+-+-+-+----blade-015
+-+-+-+-+-+----blade-016
+-------------eth2---gateway(aka cruncher)
+-------------eth3---gateway(aka cruncher)
+-------------eth4---gateway(aka cruncher)
+-------------eth5---gateway(aka cruncher)
+-------------eth6---gateway(aka cruncher)
+-------------eth7---gateway(aka cruncher)
+-------------eth8---gateway(aka cruncher)
+-------------eth9---gateway(aka cruncher)
Each + column is a VC-switching module
Blades have eth0 ... eth5 connected in the shown hadwired matrix way.
There are additional stacking links between the VC-switching modules not shown
here.
But it looks like the shortest path algorithm keeps rr neatly ordered between
blades.
But when I distribute the gateway connections equally to all switch modules,
only one of them is "link-active", the others are shown as "link-failover".
Only when I connect all of them to a single VC and configure them using LACP,
they are used in parallel. But not matching the round robing mode, right as
you mention.
> one-switch-per-interface sort of arrangement that blade environments
> impose, and never really got bonding to work well for load balancing in
> those type of environments.
>
> One issue for production use was that if a switch port fails on
> one of the switches, the other peers sending traffic into that switch
Well, I think there are different goals in Hig-PERFORMACE-clustering as
opposed to High-AVAILABILITY-clustering.
Most of "production use" referst to web server or enterprise system stuff,
which are basically HA, I'd say. And thats what those boxes are optimsied
for - see the link-failover issue above.
Setting up a new HPC-cluster with a bunch of dollar notes, I presumably would
simply go for infiniband instead of ethernet (or at least 10 GB ethernet),
but there's no budget way for that. I simly try to get best out of the stuff
that I can pick up at the lower end of the food chain ;-)
HM, so what??
I'll try to read the HP docu stuff whether I can get rid of the failover
behaviour. If I only could just rip off all the stacking links and let the VC
modules each behave as a "good old cheap and silly" switch....
Wolfgang Rosner
next prev parent reply other threads:[~2015-03-14 17:44 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-13 21:26 TEQL for bonding Multi Gbit Ethernet in a cluster? Wolfgang Rosner
2015-03-14 0:39 ` Jay Vosburgh
2015-03-14 17:44 ` Wolfgang Rosner [this message]
2015-03-16 9:10 ` TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks Wolfgang Rosner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201503141844.04588.wrosner@tirnet.de \
--to=wrosner@tirnet.de \
--cc=lartc@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.