Re: RAID performance - Stan Hoeppner

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stan Hoeppner <stan@hardwarefreak.com>
To: Adam Goryachev <mailinglists@websitemanagers.com.au>
Cc: Dave Cundiff <syshackmin@gmail.com>, linux-raid@vger.kernel.org
Subject: Re: RAID performance
Date: Mon, 11 Feb 2013 20:46:52 -0600	[thread overview]
Message-ID: <5119AD1C.8030000@hardwarefreak.com> (raw)
In-Reply-To: <6990fbda-f741-454a-80cd-bdcdfd8c971c@email.android.com>

If it's OK I'm going to snip a bunch of this and get to the meat of it,
so hopefully it's less confusing.

On 2/10/2013 10:16 AM, Adam Goryachev wrote:
...
...
> The problem is that (from my understanding) LACP will balance the traffic based on the destination MAC address, by default. So the bandwidth between any two machines is limited to a single 1Gbps link. So regardless of the number of ethernet ports on the DC box, it will only ever use a max of 1Gb[s to talk to the iSCSI server.

> However, if I configure Linux to use xmit_hash_policy=1 it will use the IP address and port (layer 3+4) to decide which trunk to use. It will still only use 1Gbps to talk to that IP:port combination.

That is correct.  Long story short, the last time I messed with a
configuration such as this I was using a Cisco that fanned over 802.3ad
groups based on L3/4 info.  Stock 802.3ad won't do this.  I apologize
for the confusion, and for the delay in responding (twas a weekend after
all).  I just finished reading the relevant section of your GS716T-200
(GST716-v2) manual, and it does not appear to have this capability.

All is not lost.  I've done a considerable amount of analysis of all the
information you've provided.  In fact I've spent way to much time on
this.  But it's an intriguing problem involving interesting systems
assembled from channel parts, i.e. "DIY", and I couldn't put it down.  I
was hoping to come up with a long term solution that didn't require any
more hardware than a NIC and HBA, but that's just not really feasible.
So, my conclusions and recommendations, based on all the information I
have to date:

1.  Channel bonding via a single switch using standard link aggregation
    protocols cannot scale iSCSI throughput between two hosts.  The
    various Linux packet fanning modes don't work well here either for
    scaling both transmit and receive traffic.

2.  To scale iSCSI throughput using a single switch will require
    multiple host ports and MPIO, but no LAG for these ports.

3.  Given the facts above, an extra port could be added to each TS Xen
    box.  A separate subnet would be created for the iSCSI SAN traffic,
    and each port given an IP in the subnet.  Both ports would carry
    MPIO iSCSI packets, but only one port would carry user traffic.

4.  Given the fact that there will almost certainly be TS users on the
    target box when the DC VM gets migrated due to some kind of failure
    or maintenance, adding the load of file sharing may not prove
    desirable.  And you'd need another switch.  Thus, I'd recommend:

A.  Dedicate the DC Xen box as a file server and dedicate a non-TS
    Xen box as its failover partner.  Each machine will receive a quad
    port NIC.  Two ports on each host will be connected to the current
    16 port switch.  The two ports will be configured to balance-alb
    using the current user network IP address.  All switch ports will
    be reconfigured to standard mode, no LAGs, as they are not needed
    for Linux balance-alb.  Disconnect the 8111 mobo ports on these two
    boxes from the switch as they're no longer needed.  Prioritize RDP
    in the switch, leave all other protocols alone.

B.  We remove 4 links each from the iSCSI servers, the primary and the
    DRBD backup server, from the switch.  This frees up 8 ports for
    connecting the file servers' 4 ports, and connecting a motherboard
    ethernet port from each iSCSI server to the switch for management.
    If my math is correct this should leave two ports free.

C.  MPIO is designed specifically for IO scaling, and works well.
    So it's a better fit, and you save the cost of the additional
    switch(es) that would be required to do perfect balance-rr bonding
    between iSCSI hosts (which can be done easily with each host
    ethernet port connected to a different dedicated SAN switch.  In
    this case it would require 4 additional switches.  Instead what
    we'll do here is connect the remaining 2 ports from each Xen file
    server box, the primary and the backup, and all 4 ports on each
    iSCSI server, the primary and the backup, to a new 12-16 port
    switch.  It can be any cheap unmanaged GbE switch of 12 or more
    ports.  We'll assign an IP address in the new SAN subnet to each
    physical port on these 4 boxes and configure MPIO accordingly.

    So what we end up with is decent session based scaling of user CIFS
    traffic between the TS hosts and the DC Xen servers, with no single
    TS host bogging everyone down, and no desktop lag if both links are
    full due to two greedy users.  We end up with nearly perfect
    ~200MB/s iSCSI scaling in both directions between the DC Xen box
    (and/or backup) and the iSCSI servers, and we end up with nearly
    perfect ~400MB/s each way between the two iSCSI servers via DRBD,
    allowing you to easily do mirroring in real-time.

All for the cost of two quad port NICs and an inexpensive switch, and
possibly a new high performance SAS HBA.  I analyzed many possible paths
to a solution, and I think this one is probably close to ideal.

You can pull off the same basic concept buying just the quad port HBA
for the current DC Xen box, removing 2 links between each iSCSI server
and the switch and direct connecting these 4 NIC ports via 2 cross over
cables, and using yet another IP subnet for these, with MPIO.  You'd
have no failover for the DC, and the bandwidth between the iSCSI servers
for BRBD would be cut in half.  But it only costs one quad port NIC.  A
dedicated 200MB/s is probably more than plenty for live DRBD, but again
you have no DC failover.

However, given that you've designed this system with "redundancy
everywhere" in mind, I'm guessing the additional redundancy justifies
the capital outlay for an unmanaged switch and a 2nd quad port NIC.

<BIG snip>

> So, given the above, would you still suggest only adding a 4port ethernet to the DC box configured with LACP, or should I really look at something else.

I think LACP is out, regardless of transmit hash mode.

If one of those test boxes could be permanently deployed as the failover
host for the DC VM, I think the dedicated iSCSI switch architecture
makes the most sense long term.  If the cost of the switch and another 4
port NIC isn't in the cards right now, you can go the other route with
just one new NIC.  And given that you'll be doing no ethernet channel
bonding on the iSCSI network, but IP based MPIO instead, it's a snap to
convert to the redundant architecture with new switch later.  All you'll
be doing is swapping cables to the new switch and changing IP address
bindings on the NICs as needed.

Again, apologies for the false start with the 802.3ad confusion on my
part.  I think you'll find all (or at least most) of the ducks in a row
in the recommendations above.

-- 
Stan

next prev parent reply	other threads:[~2013-02-12  2:46 UTC|newest]

Thread overview: 131+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-07  6:48 RAID performance Adam Goryachev
2013-02-07  6:51 ` Adam Goryachev
2013-02-07  8:24   ` Stan Hoeppner
2013-02-07  7:02 ` Carsten Aulbert
2013-02-07 10:12   ` Adam Goryachev
2013-02-07 10:29     ` Carsten Aulbert
2013-02-07 10:41       ` Adam Goryachev
2013-02-07  8:11 ` Stan Hoeppner
2013-02-07 10:05   ` Adam Goryachev
2013-02-16  4:33     ` RAID performance - *Slow SSDs likely solved* Stan Hoeppner
     [not found]       ` <cfefe7a6-a13f-413c-9e3d-e061c68dc01b@email.android.com>
2013-02-17  5:01         ` Stan Hoeppner
2013-02-08  7:21   ` RAID performance Adam Goryachev
2013-02-08  7:37     ` Chris Murphy
2013-02-08 13:04     ` Stan Hoeppner
2013-02-07  9:07 ` Dave Cundiff
2013-02-07 10:19   ` Adam Goryachev
2013-02-07 11:07     ` Dave Cundiff
2013-02-07 12:49       ` Adam Goryachev
2013-02-07 12:53         ` Phil Turmel
2013-02-07 12:58           ` Adam Goryachev
2013-02-07 13:03             ` Phil Turmel
2013-02-07 13:08               ` Adam Goryachev
2013-02-07 13:20                 ` Mikael Abrahamsson
2013-02-07 22:03               ` Chris Murphy
2013-02-07 23:48                 ` Chris Murphy
2013-02-08  0:02                   ` Chris Murphy
2013-02-08  6:25                     ` Adam Goryachev
2013-02-08  7:35                       ` Chris Murphy
2013-02-08  8:34                         ` Chris Murphy
2013-02-08 14:31                           ` Adam Goryachev
2013-02-08 14:19                         ` Adam Goryachev
2013-02-08  6:15                   ` Adam Goryachev
2013-02-07 15:32         ` Dave Cundiff
2013-02-08 13:58           ` Adam Goryachev
2013-02-08 21:42             ` Stan Hoeppner
2013-02-14 22:42               ` Chris Murphy
2013-02-15  1:10                 ` Adam Goryachev
2013-02-15  1:40                   ` Chris Murphy
2013-02-15  4:01                     ` Adam Goryachev
2013-02-15  5:14                       ` Chris Murphy
2013-02-15 11:10                         ` Adam Goryachev
2013-02-15 23:01                           ` Chris Murphy
2013-02-17  9:52             ` RAID performance - new kernel results Adam Goryachev
2013-02-18 13:20               ` RAID performance - new kernel results - 5x SSD RAID5 Stan Hoeppner
2013-02-20 17:10                 ` Adam Goryachev
2013-02-21  6:04                   ` Stan Hoeppner
2013-02-21  6:40                     ` Adam Goryachev
2013-02-21  8:47                       ` Joseph Glanville
2013-02-22  8:10                       ` Stan Hoeppner
2013-02-24 20:36                         ` Stan Hoeppner
2013-03-01 16:06                           ` Adam Goryachev
2013-03-02  9:15                             ` Stan Hoeppner
2013-03-02 17:07                               ` Phil Turmel
2013-03-02 23:48                                 ` Stan Hoeppner
2013-03-03  2:35                                   ` Phil Turmel
2013-03-03 15:19                                 ` Adam Goryachev
2013-03-04  1:31                                   ` Phil Turmel
2013-03-04  9:39                                     ` Adam Goryachev
2013-03-04 12:41                                       ` Phil Turmel
2013-03-04 12:42                                       ` Stan Hoeppner
2013-03-04  5:25                                   ` Stan Hoeppner
2013-03-03 17:32                               ` Adam Goryachev
2013-03-04 12:20                                 ` Stan Hoeppner
2013-03-04 16:26                                   ` Adam Goryachev
2013-03-05  9:30                                     ` RAID performance - 5x SSD RAID5 - effects of stripe cache sizing Stan Hoeppner
2013-03-05 15:53                                       ` Adam Goryachev
2013-03-07  7:36                                         ` Stan Hoeppner
2013-03-08  0:17                                           ` Adam Goryachev
2013-03-08  4:02                                             ` Stan Hoeppner
2013-03-08  5:57                                               ` Mikael Abrahamsson
2013-03-08 10:09                                                 ` Stan Hoeppner
2013-03-08 14:11                                                   ` Mikael Abrahamsson
2013-02-21 17:41                     ` RAID performance - new kernel results - 5x SSD RAID5 David Brown
2013-02-23  6:41                       ` Stan Hoeppner
2013-02-23 15:57               ` RAID performance - new kernel results John Stoffel
2013-03-01 16:10                 ` Adam Goryachev
2013-03-10 15:35                   ` Charles Polisher
2013-04-15 12:23                     ` Adam Goryachev
2013-04-15 15:31                       ` John Stoffel
2013-04-17 10:15                         ` Adam Goryachev
2013-04-15 16:49                       ` Roy Sigurd Karlsbakk
2013-04-15 20:16                       ` Phil Turmel
2013-04-16 19:28                         ` Roy Sigurd Karlsbakk
2013-04-16 21:03                           ` Phil Turmel
2013-04-16 21:43                           ` Stan Hoeppner
2013-04-15 20:42                       ` Stan Hoeppner
2013-02-08  3:32       ` RAID performance Stan Hoeppner
2013-02-08  7:11         ` Adam Goryachev
2013-02-08 17:10           ` Stan Hoeppner
2013-02-08 18:44             ` Adam Goryachev
2013-02-09  4:09               ` Stan Hoeppner
2013-02-10  4:40                 ` Adam Goryachev
2013-02-10 13:22                   ` Stan Hoeppner
2013-02-10 16:16                     ` Adam Goryachev
2013-02-10 17:19                       ` Mikael Abrahamsson
2013-02-10 21:57                         ` Adam Goryachev
2013-02-11  3:41                           ` Adam Goryachev
2013-02-11  4:33                           ` Mikael Abrahamsson
2013-02-12  2:46                       ` Stan Hoeppner [this message]
2013-02-12  5:33                         ` Adam Goryachev
2013-02-13  7:56                           ` Stan Hoeppner
2013-02-13 13:48                             ` Phil Turmel
2013-02-13 16:17                             ` Adam Goryachev
2013-02-13 20:20                               ` Adam Goryachev
2013-02-14 12:22                                 ` Stan Hoeppner
2013-02-15 13:31                                   ` Stan Hoeppner
2013-02-15 14:32                                     ` Adam Goryachev
2013-02-16  1:07                                       ` Stan Hoeppner
2013-02-16 17:19                                         ` Adam Goryachev
2013-02-17  1:42                                           ` Stan Hoeppner
2013-02-17  5:02                                             ` Adam Goryachev
2013-02-17  6:28                                               ` Stan Hoeppner
2013-02-17  8:41                                                 ` Adam Goryachev
2013-02-17 13:58                                                   ` Stan Hoeppner
2013-02-17 14:46                                                     ` Adam Goryachev
2013-02-19  8:17                                                       ` Stan Hoeppner
2013-02-20 16:45                                                         ` Adam Goryachev
2013-02-21  0:45                                                           ` Stan Hoeppner
2013-02-21  3:10                                                             ` Adam Goryachev
2013-02-22 11:19                                                               ` Stan Hoeppner
2013-02-22 15:25                                                                 ` Charles Polisher
2013-02-23  4:14                                                                   ` Stan Hoeppner
2013-02-12  7:34                         ` Mikael Abrahamsson
2013-02-08  7:17         ` Adam Goryachev
2013-02-07 12:01     ` Brad Campbell
2013-02-07 12:37       ` Adam Goryachev
2013-02-07 17:12         ` Fredrik Lindgren
2013-02-08  0:00           ` Adam Goryachev
2013-02-11 19:49   ` Roy Sigurd Karlsbakk
2013-02-11 20:30     ` Dave Cundiff
2013-02-07 11:32 ` Mikael Abrahamsson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5119AD1C.8030000@hardwarefreak.com \
    --to=stan@hardwarefreak.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=mailinglists@websitemanagers.com.au \
    --cc=syshackmin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.