From: Adam Goryachev <mailinglists@websitemanagers.com.au>
To: Mikael Abrahamsson <swmike@swm.pp.se>
Cc: Stan Hoeppner <stan@hardwarefreak.com>,
Dave Cundiff <syshackmin@gmail.com>,
linux-raid@vger.kernel.org
Subject: Re: RAID performance
Date: Mon, 11 Feb 2013 08:57:44 +1100 [thread overview]
Message-ID: <511817D8.7020809@websitemanagers.com.au> (raw)
In-Reply-To: <alpine.DEB.2.00.1302101812060.32644@uplift.swm.pp.se>
On 11/02/13 04:19, Mikael Abrahamsson wrote:
> On Mon, 11 Feb 2013, Adam Goryachev wrote:
>
>> Nope, I'm saying that on 5 different (specifically machines 1, 4, 5,
>> 6, 7) physical boxes, (the xen host) if I do a dd
>> if=/dev/disk/by-path/iscsivm1 of=/dev/null on 5 machines concurrently,
>> then they only get 20Mbps each. If I do one at a time, I get 130Mbps,
>> if I do two at a time, I get 60Mbps, etc... If I do the same test on
>> machines 1, 2, 3, 8 at the same time, each gets 130Mbps
>
> When you say Mbps, I read that as Megabit/s. Are you in fact referring
> to megabyte/s?
Ooops, my mistake, yes, I meant MB/s for these results, because that is
what dd provides output as.
> I suspect the load balancing (hasing) function on the switch terminating
> the LAG is causing your problem. Typically this hashing function doesn't
> look at load on individual links, but a specific src/dst/port hash
> points to a certain link, and there isn't really anything you can do
> about it. The only way around it is to go 10GE instead of the LAG, or
> move away from the LAG and assign 4 different IPs, one per physical
> link, and then make sure routing to/from server/client always goes onto
> the same link, cutting worst-case down to two servers sharing one link
> (8 servers, 4 links).
Given the flat topology, I think it is difficult (not impossible) to
ensure that both inbound and outbound traffic will be sent/received on
the correct interface. Since the route TO any of the 8 destinations is
on the same network, linux would choose the lowest numbered interface
(AFAIK) for all outbound traffic. Getting the right outbound interface
is the first issue, once solved, ensuring that each interface will only
send an ARP reply for its own IP is the second issue. Both of these are
solvable...
However, this adds lots of complexity, and this system is supposed to
allow heartbeat to automatically move the 'floating' IP to the secondary
server on failure, which certainly adds some complications there also.
It'd be nice to avoid all that, but if that is what is needed, then I'll
have to address all that.
>> The problem is that (from my understanding) LACP will balance the
>> traffic based on the destination MAC address, by default. So the
>> bandwidth between any two machines is limited to a single 1Gbps link.
>> So regardless of the number of ethernet ports on the DC box, it will
>> only ever use a max of 1Gb[s to talk to the iSCSI server.
>
> LACP is a way to set up a bunch of ports in a channel. It doesn't affect
> how traffic will be shared, that is a property of the hardware/software
> mix in the switch/operating (LACP is control plane, it's not forwarding
> plane). Device egressing the packet onto a link decides what port it
> goes out of, typically done on properties on L2, L3 and L4 (different
> for different devices).
>
>> However, if I configure Linux to use xmit_hash_policy=1 it will use
>> the IP address and port (layer 3+4) to decide which trunk to use. It
>> will still only use 1Gbps to talk to that IP:port combination.
>
> As expected. You do not want to send packets belonging to a single
> "session" out different ports, because then you might get packet
> reordering. This is called "per-packet load sharing", if it's desireable
> then it might be possible to enable in the equipment. TCP doesn't like
> it though, don't know how storage protocols react.
Hmmm, so from my reading, it seems that out of order packets will never
be received by the SAN, since the sender only has 1Gbps, and the switch
will only deliver the data over one port anyway.
However, the clients (8 physical machines) would certainly receive out
of order packets, since the SAN is sending over 4 x 1Gbps of data, and
the switch is delivering this too fast to the single 1Gbps port, and so
probably add some packet loss when queues fill up, and this would slow
everything down.
I see a kernel option net.ipv4.tcp_reordering, would setting this value
to a higher figure allow me to use RR for the bonded connections, even
if the server has more total bandwidth than the recipient?
If I use a 10G connection for the SAN, and multiple 1G connections for
the clients, then I will still end up with a max of 1G read speed, since
the switch will only deliver data on a single port. So to get better
than 1G speed, I must use higher bandwidth channels, but using 10G on
all machines allows a single server to "flood" the network...
I suppose accepting max performance of 100MB/s for any individual client
could be acceptable, and if I could ensure that each client would
connect over a distinct port, I could drop in 2 x 4port ethernet devices
to the SAN, but I suspect this won't work because either the switch or
Linux will not properly balance the traffic. Potentially, I could
manually configure the MAC address on the clients, leave Linux to use
MAC based routing, such that the custom MAC address will calculate a
unique port for each. That just leaves the switch sending traffic back
to the SAN, and I don't know how I would go about that... Perhaps it
uses the source MAC address to decide the destination trunk, which will
either work because of the first fix above, or not work because of the
first fix above (if the calculations on Linux are different to the
switch)...
I'm still at a loss on how to correctly configure my network to solve
these issues, any hints would be appreciated.
Regards,
Adam
--
Adam Goryachev
Website Managers
www.websitemanagers.com.au
next prev parent reply other threads:[~2013-02-10 21:57 UTC|newest]
Thread overview: 131+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-07 6:48 RAID performance Adam Goryachev
2013-02-07 6:51 ` Adam Goryachev
2013-02-07 8:24 ` Stan Hoeppner
2013-02-07 7:02 ` Carsten Aulbert
2013-02-07 10:12 ` Adam Goryachev
2013-02-07 10:29 ` Carsten Aulbert
2013-02-07 10:41 ` Adam Goryachev
2013-02-07 8:11 ` Stan Hoeppner
2013-02-07 10:05 ` Adam Goryachev
2013-02-16 4:33 ` RAID performance - *Slow SSDs likely solved* Stan Hoeppner
[not found] ` <cfefe7a6-a13f-413c-9e3d-e061c68dc01b@email.android.com>
2013-02-17 5:01 ` Stan Hoeppner
2013-02-08 7:21 ` RAID performance Adam Goryachev
2013-02-08 7:37 ` Chris Murphy
2013-02-08 13:04 ` Stan Hoeppner
2013-02-07 9:07 ` Dave Cundiff
2013-02-07 10:19 ` Adam Goryachev
2013-02-07 11:07 ` Dave Cundiff
2013-02-07 12:49 ` Adam Goryachev
2013-02-07 12:53 ` Phil Turmel
2013-02-07 12:58 ` Adam Goryachev
2013-02-07 13:03 ` Phil Turmel
2013-02-07 13:08 ` Adam Goryachev
2013-02-07 13:20 ` Mikael Abrahamsson
2013-02-07 22:03 ` Chris Murphy
2013-02-07 23:48 ` Chris Murphy
2013-02-08 0:02 ` Chris Murphy
2013-02-08 6:25 ` Adam Goryachev
2013-02-08 7:35 ` Chris Murphy
2013-02-08 8:34 ` Chris Murphy
2013-02-08 14:31 ` Adam Goryachev
2013-02-08 14:19 ` Adam Goryachev
2013-02-08 6:15 ` Adam Goryachev
2013-02-07 15:32 ` Dave Cundiff
2013-02-08 13:58 ` Adam Goryachev
2013-02-08 21:42 ` Stan Hoeppner
2013-02-14 22:42 ` Chris Murphy
2013-02-15 1:10 ` Adam Goryachev
2013-02-15 1:40 ` Chris Murphy
2013-02-15 4:01 ` Adam Goryachev
2013-02-15 5:14 ` Chris Murphy
2013-02-15 11:10 ` Adam Goryachev
2013-02-15 23:01 ` Chris Murphy
2013-02-17 9:52 ` RAID performance - new kernel results Adam Goryachev
2013-02-18 13:20 ` RAID performance - new kernel results - 5x SSD RAID5 Stan Hoeppner
2013-02-20 17:10 ` Adam Goryachev
2013-02-21 6:04 ` Stan Hoeppner
2013-02-21 6:40 ` Adam Goryachev
2013-02-21 8:47 ` Joseph Glanville
2013-02-22 8:10 ` Stan Hoeppner
2013-02-24 20:36 ` Stan Hoeppner
2013-03-01 16:06 ` Adam Goryachev
2013-03-02 9:15 ` Stan Hoeppner
2013-03-02 17:07 ` Phil Turmel
2013-03-02 23:48 ` Stan Hoeppner
2013-03-03 2:35 ` Phil Turmel
2013-03-03 15:19 ` Adam Goryachev
2013-03-04 1:31 ` Phil Turmel
2013-03-04 9:39 ` Adam Goryachev
2013-03-04 12:41 ` Phil Turmel
2013-03-04 12:42 ` Stan Hoeppner
2013-03-04 5:25 ` Stan Hoeppner
2013-03-03 17:32 ` Adam Goryachev
2013-03-04 12:20 ` Stan Hoeppner
2013-03-04 16:26 ` Adam Goryachev
2013-03-05 9:30 ` RAID performance - 5x SSD RAID5 - effects of stripe cache sizing Stan Hoeppner
2013-03-05 15:53 ` Adam Goryachev
2013-03-07 7:36 ` Stan Hoeppner
2013-03-08 0:17 ` Adam Goryachev
2013-03-08 4:02 ` Stan Hoeppner
2013-03-08 5:57 ` Mikael Abrahamsson
2013-03-08 10:09 ` Stan Hoeppner
2013-03-08 14:11 ` Mikael Abrahamsson
2013-02-21 17:41 ` RAID performance - new kernel results - 5x SSD RAID5 David Brown
2013-02-23 6:41 ` Stan Hoeppner
2013-02-23 15:57 ` RAID performance - new kernel results John Stoffel
2013-03-01 16:10 ` Adam Goryachev
2013-03-10 15:35 ` Charles Polisher
2013-04-15 12:23 ` Adam Goryachev
2013-04-15 15:31 ` John Stoffel
2013-04-17 10:15 ` Adam Goryachev
2013-04-15 16:49 ` Roy Sigurd Karlsbakk
2013-04-15 20:16 ` Phil Turmel
2013-04-16 19:28 ` Roy Sigurd Karlsbakk
2013-04-16 21:03 ` Phil Turmel
2013-04-16 21:43 ` Stan Hoeppner
2013-04-15 20:42 ` Stan Hoeppner
2013-02-08 3:32 ` RAID performance Stan Hoeppner
2013-02-08 7:11 ` Adam Goryachev
2013-02-08 17:10 ` Stan Hoeppner
2013-02-08 18:44 ` Adam Goryachev
2013-02-09 4:09 ` Stan Hoeppner
2013-02-10 4:40 ` Adam Goryachev
2013-02-10 13:22 ` Stan Hoeppner
2013-02-10 16:16 ` Adam Goryachev
2013-02-10 17:19 ` Mikael Abrahamsson
2013-02-10 21:57 ` Adam Goryachev [this message]
2013-02-11 3:41 ` Adam Goryachev
2013-02-11 4:33 ` Mikael Abrahamsson
2013-02-12 2:46 ` Stan Hoeppner
2013-02-12 5:33 ` Adam Goryachev
2013-02-13 7:56 ` Stan Hoeppner
2013-02-13 13:48 ` Phil Turmel
2013-02-13 16:17 ` Adam Goryachev
2013-02-13 20:20 ` Adam Goryachev
2013-02-14 12:22 ` Stan Hoeppner
2013-02-15 13:31 ` Stan Hoeppner
2013-02-15 14:32 ` Adam Goryachev
2013-02-16 1:07 ` Stan Hoeppner
2013-02-16 17:19 ` Adam Goryachev
2013-02-17 1:42 ` Stan Hoeppner
2013-02-17 5:02 ` Adam Goryachev
2013-02-17 6:28 ` Stan Hoeppner
2013-02-17 8:41 ` Adam Goryachev
2013-02-17 13:58 ` Stan Hoeppner
2013-02-17 14:46 ` Adam Goryachev
2013-02-19 8:17 ` Stan Hoeppner
2013-02-20 16:45 ` Adam Goryachev
2013-02-21 0:45 ` Stan Hoeppner
2013-02-21 3:10 ` Adam Goryachev
2013-02-22 11:19 ` Stan Hoeppner
2013-02-22 15:25 ` Charles Polisher
2013-02-23 4:14 ` Stan Hoeppner
2013-02-12 7:34 ` Mikael Abrahamsson
2013-02-08 7:17 ` Adam Goryachev
2013-02-07 12:01 ` Brad Campbell
2013-02-07 12:37 ` Adam Goryachev
2013-02-07 17:12 ` Fredrik Lindgren
2013-02-08 0:00 ` Adam Goryachev
2013-02-11 19:49 ` Roy Sigurd Karlsbakk
2013-02-11 20:30 ` Dave Cundiff
2013-02-07 11:32 ` Mikael Abrahamsson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=511817D8.7020809@websitemanagers.com.au \
--to=mailinglists@websitemanagers.com.au \
--cc=linux-raid@vger.kernel.org \
--cc=stan@hardwarefreak.com \
--cc=swmike@swm.pp.se \
--cc=syshackmin@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.