From: Stan Hoeppner <stan@hardwarefreak.com>
To: Adam Goryachev <mailinglists@websitemanagers.com.au>
Cc: Dave Cundiff <syshackmin@gmail.com>, linux-raid@vger.kernel.org
Subject: Re: RAID performance
Date: Mon, 11 Feb 2013 20:46:52 -0600 [thread overview]
Message-ID: <5119AD1C.8030000@hardwarefreak.com> (raw)
In-Reply-To: <6990fbda-f741-454a-80cd-bdcdfd8c971c@email.android.com>
If it's OK I'm going to snip a bunch of this and get to the meat of it,
so hopefully it's less confusing.
On 2/10/2013 10:16 AM, Adam Goryachev wrote:
...
...
> The problem is that (from my understanding) LACP will balance the traffic based on the destination MAC address, by default. So the bandwidth between any two machines is limited to a single 1Gbps link. So regardless of the number of ethernet ports on the DC box, it will only ever use a max of 1Gb[s to talk to the iSCSI server.
> However, if I configure Linux to use xmit_hash_policy=1 it will use the IP address and port (layer 3+4) to decide which trunk to use. It will still only use 1Gbps to talk to that IP:port combination.
That is correct. Long story short, the last time I messed with a
configuration such as this I was using a Cisco that fanned over 802.3ad
groups based on L3/4 info. Stock 802.3ad won't do this. I apologize
for the confusion, and for the delay in responding (twas a weekend after
all). I just finished reading the relevant section of your GS716T-200
(GST716-v2) manual, and it does not appear to have this capability.
All is not lost. I've done a considerable amount of analysis of all the
information you've provided. In fact I've spent way to much time on
this. But it's an intriguing problem involving interesting systems
assembled from channel parts, i.e. "DIY", and I couldn't put it down. I
was hoping to come up with a long term solution that didn't require any
more hardware than a NIC and HBA, but that's just not really feasible.
So, my conclusions and recommendations, based on all the information I
have to date:
1. Channel bonding via a single switch using standard link aggregation
protocols cannot scale iSCSI throughput between two hosts. The
various Linux packet fanning modes don't work well here either for
scaling both transmit and receive traffic.
2. To scale iSCSI throughput using a single switch will require
multiple host ports and MPIO, but no LAG for these ports.
3. Given the facts above, an extra port could be added to each TS Xen
box. A separate subnet would be created for the iSCSI SAN traffic,
and each port given an IP in the subnet. Both ports would carry
MPIO iSCSI packets, but only one port would carry user traffic.
4. Given the fact that there will almost certainly be TS users on the
target box when the DC VM gets migrated due to some kind of failure
or maintenance, adding the load of file sharing may not prove
desirable. And you'd need another switch. Thus, I'd recommend:
A. Dedicate the DC Xen box as a file server and dedicate a non-TS
Xen box as its failover partner. Each machine will receive a quad
port NIC. Two ports on each host will be connected to the current
16 port switch. The two ports will be configured to balance-alb
using the current user network IP address. All switch ports will
be reconfigured to standard mode, no LAGs, as they are not needed
for Linux balance-alb. Disconnect the 8111 mobo ports on these two
boxes from the switch as they're no longer needed. Prioritize RDP
in the switch, leave all other protocols alone.
B. We remove 4 links each from the iSCSI servers, the primary and the
DRBD backup server, from the switch. This frees up 8 ports for
connecting the file servers' 4 ports, and connecting a motherboard
ethernet port from each iSCSI server to the switch for management.
If my math is correct this should leave two ports free.
C. MPIO is designed specifically for IO scaling, and works well.
So it's a better fit, and you save the cost of the additional
switch(es) that would be required to do perfect balance-rr bonding
between iSCSI hosts (which can be done easily with each host
ethernet port connected to a different dedicated SAN switch. In
this case it would require 4 additional switches. Instead what
we'll do here is connect the remaining 2 ports from each Xen file
server box, the primary and the backup, and all 4 ports on each
iSCSI server, the primary and the backup, to a new 12-16 port
switch. It can be any cheap unmanaged GbE switch of 12 or more
ports. We'll assign an IP address in the new SAN subnet to each
physical port on these 4 boxes and configure MPIO accordingly.
So what we end up with is decent session based scaling of user CIFS
traffic between the TS hosts and the DC Xen servers, with no single
TS host bogging everyone down, and no desktop lag if both links are
full due to two greedy users. We end up with nearly perfect
~200MB/s iSCSI scaling in both directions between the DC Xen box
(and/or backup) and the iSCSI servers, and we end up with nearly
perfect ~400MB/s each way between the two iSCSI servers via DRBD,
allowing you to easily do mirroring in real-time.
All for the cost of two quad port NICs and an inexpensive switch, and
possibly a new high performance SAS HBA. I analyzed many possible paths
to a solution, and I think this one is probably close to ideal.
You can pull off the same basic concept buying just the quad port HBA
for the current DC Xen box, removing 2 links between each iSCSI server
and the switch and direct connecting these 4 NIC ports via 2 cross over
cables, and using yet another IP subnet for these, with MPIO. You'd
have no failover for the DC, and the bandwidth between the iSCSI servers
for BRBD would be cut in half. But it only costs one quad port NIC. A
dedicated 200MB/s is probably more than plenty for live DRBD, but again
you have no DC failover.
However, given that you've designed this system with "redundancy
everywhere" in mind, I'm guessing the additional redundancy justifies
the capital outlay for an unmanaged switch and a 2nd quad port NIC.
<BIG snip>
> So, given the above, would you still suggest only adding a 4port ethernet to the DC box configured with LACP, or should I really look at something else.
I think LACP is out, regardless of transmit hash mode.
If one of those test boxes could be permanently deployed as the failover
host for the DC VM, I think the dedicated iSCSI switch architecture
makes the most sense long term. If the cost of the switch and another 4
port NIC isn't in the cards right now, you can go the other route with
just one new NIC. And given that you'll be doing no ethernet channel
bonding on the iSCSI network, but IP based MPIO instead, it's a snap to
convert to the redundant architecture with new switch later. All you'll
be doing is swapping cables to the new switch and changing IP address
bindings on the NICs as needed.
Again, apologies for the false start with the 802.3ad confusion on my
part. I think you'll find all (or at least most) of the ducks in a row
in the recommendations above.
--
Stan
next prev parent reply other threads:[~2013-02-12 2:46 UTC|newest]
Thread overview: 131+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-07 6:48 RAID performance Adam Goryachev
2013-02-07 6:51 ` Adam Goryachev
2013-02-07 8:24 ` Stan Hoeppner
2013-02-07 7:02 ` Carsten Aulbert
2013-02-07 10:12 ` Adam Goryachev
2013-02-07 10:29 ` Carsten Aulbert
2013-02-07 10:41 ` Adam Goryachev
2013-02-07 8:11 ` Stan Hoeppner
2013-02-07 10:05 ` Adam Goryachev
2013-02-16 4:33 ` RAID performance - *Slow SSDs likely solved* Stan Hoeppner
[not found] ` <cfefe7a6-a13f-413c-9e3d-e061c68dc01b@email.android.com>
2013-02-17 5:01 ` Stan Hoeppner
2013-02-08 7:21 ` RAID performance Adam Goryachev
2013-02-08 7:37 ` Chris Murphy
2013-02-08 13:04 ` Stan Hoeppner
2013-02-07 9:07 ` Dave Cundiff
2013-02-07 10:19 ` Adam Goryachev
2013-02-07 11:07 ` Dave Cundiff
2013-02-07 12:49 ` Adam Goryachev
2013-02-07 12:53 ` Phil Turmel
2013-02-07 12:58 ` Adam Goryachev
2013-02-07 13:03 ` Phil Turmel
2013-02-07 13:08 ` Adam Goryachev
2013-02-07 13:20 ` Mikael Abrahamsson
2013-02-07 22:03 ` Chris Murphy
2013-02-07 23:48 ` Chris Murphy
2013-02-08 0:02 ` Chris Murphy
2013-02-08 6:25 ` Adam Goryachev
2013-02-08 7:35 ` Chris Murphy
2013-02-08 8:34 ` Chris Murphy
2013-02-08 14:31 ` Adam Goryachev
2013-02-08 14:19 ` Adam Goryachev
2013-02-08 6:15 ` Adam Goryachev
2013-02-07 15:32 ` Dave Cundiff
2013-02-08 13:58 ` Adam Goryachev
2013-02-08 21:42 ` Stan Hoeppner
2013-02-14 22:42 ` Chris Murphy
2013-02-15 1:10 ` Adam Goryachev
2013-02-15 1:40 ` Chris Murphy
2013-02-15 4:01 ` Adam Goryachev
2013-02-15 5:14 ` Chris Murphy
2013-02-15 11:10 ` Adam Goryachev
2013-02-15 23:01 ` Chris Murphy
2013-02-17 9:52 ` RAID performance - new kernel results Adam Goryachev
2013-02-18 13:20 ` RAID performance - new kernel results - 5x SSD RAID5 Stan Hoeppner
2013-02-20 17:10 ` Adam Goryachev
2013-02-21 6:04 ` Stan Hoeppner
2013-02-21 6:40 ` Adam Goryachev
2013-02-21 8:47 ` Joseph Glanville
2013-02-22 8:10 ` Stan Hoeppner
2013-02-24 20:36 ` Stan Hoeppner
2013-03-01 16:06 ` Adam Goryachev
2013-03-02 9:15 ` Stan Hoeppner
2013-03-02 17:07 ` Phil Turmel
2013-03-02 23:48 ` Stan Hoeppner
2013-03-03 2:35 ` Phil Turmel
2013-03-03 15:19 ` Adam Goryachev
2013-03-04 1:31 ` Phil Turmel
2013-03-04 9:39 ` Adam Goryachev
2013-03-04 12:41 ` Phil Turmel
2013-03-04 12:42 ` Stan Hoeppner
2013-03-04 5:25 ` Stan Hoeppner
2013-03-03 17:32 ` Adam Goryachev
2013-03-04 12:20 ` Stan Hoeppner
2013-03-04 16:26 ` Adam Goryachev
2013-03-05 9:30 ` RAID performance - 5x SSD RAID5 - effects of stripe cache sizing Stan Hoeppner
2013-03-05 15:53 ` Adam Goryachev
2013-03-07 7:36 ` Stan Hoeppner
2013-03-08 0:17 ` Adam Goryachev
2013-03-08 4:02 ` Stan Hoeppner
2013-03-08 5:57 ` Mikael Abrahamsson
2013-03-08 10:09 ` Stan Hoeppner
2013-03-08 14:11 ` Mikael Abrahamsson
2013-02-21 17:41 ` RAID performance - new kernel results - 5x SSD RAID5 David Brown
2013-02-23 6:41 ` Stan Hoeppner
2013-02-23 15:57 ` RAID performance - new kernel results John Stoffel
2013-03-01 16:10 ` Adam Goryachev
2013-03-10 15:35 ` Charles Polisher
2013-04-15 12:23 ` Adam Goryachev
2013-04-15 15:31 ` John Stoffel
2013-04-17 10:15 ` Adam Goryachev
2013-04-15 16:49 ` Roy Sigurd Karlsbakk
2013-04-15 20:16 ` Phil Turmel
2013-04-16 19:28 ` Roy Sigurd Karlsbakk
2013-04-16 21:03 ` Phil Turmel
2013-04-16 21:43 ` Stan Hoeppner
2013-04-15 20:42 ` Stan Hoeppner
2013-02-08 3:32 ` RAID performance Stan Hoeppner
2013-02-08 7:11 ` Adam Goryachev
2013-02-08 17:10 ` Stan Hoeppner
2013-02-08 18:44 ` Adam Goryachev
2013-02-09 4:09 ` Stan Hoeppner
2013-02-10 4:40 ` Adam Goryachev
2013-02-10 13:22 ` Stan Hoeppner
2013-02-10 16:16 ` Adam Goryachev
2013-02-10 17:19 ` Mikael Abrahamsson
2013-02-10 21:57 ` Adam Goryachev
2013-02-11 3:41 ` Adam Goryachev
2013-02-11 4:33 ` Mikael Abrahamsson
2013-02-12 2:46 ` Stan Hoeppner [this message]
2013-02-12 5:33 ` Adam Goryachev
2013-02-13 7:56 ` Stan Hoeppner
2013-02-13 13:48 ` Phil Turmel
2013-02-13 16:17 ` Adam Goryachev
2013-02-13 20:20 ` Adam Goryachev
2013-02-14 12:22 ` Stan Hoeppner
2013-02-15 13:31 ` Stan Hoeppner
2013-02-15 14:32 ` Adam Goryachev
2013-02-16 1:07 ` Stan Hoeppner
2013-02-16 17:19 ` Adam Goryachev
2013-02-17 1:42 ` Stan Hoeppner
2013-02-17 5:02 ` Adam Goryachev
2013-02-17 6:28 ` Stan Hoeppner
2013-02-17 8:41 ` Adam Goryachev
2013-02-17 13:58 ` Stan Hoeppner
2013-02-17 14:46 ` Adam Goryachev
2013-02-19 8:17 ` Stan Hoeppner
2013-02-20 16:45 ` Adam Goryachev
2013-02-21 0:45 ` Stan Hoeppner
2013-02-21 3:10 ` Adam Goryachev
2013-02-22 11:19 ` Stan Hoeppner
2013-02-22 15:25 ` Charles Polisher
2013-02-23 4:14 ` Stan Hoeppner
2013-02-12 7:34 ` Mikael Abrahamsson
2013-02-08 7:17 ` Adam Goryachev
2013-02-07 12:01 ` Brad Campbell
2013-02-07 12:37 ` Adam Goryachev
2013-02-07 17:12 ` Fredrik Lindgren
2013-02-08 0:00 ` Adam Goryachev
2013-02-11 19:49 ` Roy Sigurd Karlsbakk
2013-02-11 20:30 ` Dave Cundiff
2013-02-07 11:32 ` Mikael Abrahamsson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5119AD1C.8030000@hardwarefreak.com \
--to=stan@hardwarefreak.com \
--cc=linux-raid@vger.kernel.org \
--cc=mailinglists@websitemanagers.com.au \
--cc=syshackmin@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.