From: Stan Hoeppner <stan@hardwarefreak.com>
To: Mdadm <linux-raid@vger.kernel.org>
Cc: Roberto Spadim <roberto@spadim.com.br>, NeilBrown <neilb@suse.de>,
Christoph Hellwig <hch@infradead.org>, Drew <drew.kay@gmail.com>
Subject: Re: high throughput storage server?
Date: Sun, 20 Mar 2011 18:22:30 -0500 [thread overview]
Message-ID: <4D868C36.5050304@hardwarefreak.com> (raw)
In-Reply-To: <AANLkTi=2k2=YuZAggonLfKmRFxFd-rXvNo=xkpqWyQNU@mail.gmail.com>
Roberto Spadim put forth on 3/20/2011 12:32 AM:
> i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
> about the problem, post results here, this is a nice hardware question
> :)
I don't need vendor assistance to design a hardware system capable of
the 10GB/s NFS throughput target. That's relatively easy. I've already
specified one possible hardware combination capable of this level of
performance (see below). The configuration will handle 10GB/s using the
RAID function of the LSI SAS HBAs. The only question is if it has
enough individual and aggregate CPU horsepower, memory, and HT
interconnect bandwidth to do the same using mdraid. This is the reason
for my questions directed at Neil.
> don't tell about software raid, just the hardware to allow this
> bandwidth (10gb/s) and share files
I already posted some of the minimum hardware specs earlier in this
thread for the given workload I described. Following is a description
of the workload and a complete hardware specification.
Target workload:
10GB/s continuous parallel NFS throughput serving 50+ NFS clients whose
application performs large streaming reads. At the storage array level
the 50+ parallel streaming reads become a random IO pattern workload
requiring a huge number of spindles due to the high seek rates.
Minimum hardware requirements, based on performance and cost. Ballpark
guess on total cost of the hardware below is $150-250k USD. We can't
get the data to the clients without a network, so the specification
starts with the switching hardware needed.
Ethernet switches:
One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
488 Gb/s backplane switching capacity
Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
208 Gb/s backplane switching capacity
Maximum common MTU enabled (jumbo frame) globally
Connect 12 server 10 GbE ports to A5820X
Uplink 2 10 GbE ports from each A5800 to A5820X
2 open 10 GbE ports left on A5820X for cluster expansion
or off cluster data transfers to the main network
Link aggregate 12 server 10 GbE ports to A5820X
Link aggregate each client's 2 GbE ports to A5800s
Aggregate client->switch bandwidth = 12.5 GB/s
Aggregate server->switch bandwidth = 15.0 GB/s
The excess server b/w of 2.5GB/s is a result of the following:
Allowing headroom for an additional 10 clients or out of cluster
data transfers
Balancing the packet load over the 3 quad port 10 GbE server NICs
regardless of how many clients are active to prevent hot spots
in the server memory and interconnect subsystems
Server chassis
HP Proliant DL585 G7 with the following specifications
Dual AMD Opteron 6136, 16 cores @2.4GHz
20GB/s node-node HT b/w, 160GB/s aggregate
128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
20GB/s/node memory bandwidth, 80GB/s aggregate
7 PCIe x8 slots and 4 PCIe x16
8GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth
IO controllers
4 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB cache
3 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
JBOD enclosures
16 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
2 SFF 8088 host and 1 expansion port per enclosure
384 total SAS 6GB/s 2.5" drive bays
Two units are daisy chained with one in each pair
connecting to one of 8 HBA SFF8088 ports, for a total of
32 6Gb/s SAS host connections, yielding 38.4 GB/s full duplex b/w
Disks drives
384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5" SAS
6Gb/s Internal Enterprise Hard Drive
Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB/s
full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32GB/s
respectively, by approximately 20%. Also note that each drive can
stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming read
capacity for the 384 disks. This is almost 4 times the aggregate one
way transfer rate of the 4 PCIe x8 slots, and is 6 times our target host
to parallel client data rate of 10GB/s. There are a few reasons why
this excess of capacity is built into the system:
1. RAID10 is the only suitable RAID level for this type of system with
this many disks, for many reasons that have been discussed before.
RAID10 instantly cuts the number of stripe spindles in two, dropping the
data rate by a factor of 2, giving us 30.5GB/s potential aggregate
throughput. Now we're only at 3 times out target data rate.
2. As a single disk drive's seek rate increases, its transfer rate
decreases in relation to its single streaming read performance.
Parallel streaming reads will increase seek rates as the disk head must
move between different regions of the disk platter.
3. In relation to 2, if we assume we'll lose no more than 66% of our
single streaming performance with a multi stream workload, we're down to
10.1GB/s throughput, right at our target.
By using relatively small arrays of 24 drives each (12 stripe spindles),
concatenating (--linear) the 16 resulting arrays, and using a filesystem
such as XFS across the entire array with its intelligent load balancing
of streams using allocation groups, we minimize disk head seeking.
Doing this can in essence divide our 50 client streams across 16 arrays,
with each array seeing approximately 3 of the streaming client reads.
Each disk should be able to easily maintain 33% of its max read rate
while servicing 3 streaming reads.
I hope you found this informative or interesting. I enjoyed the
exercise. I'd been working on this system specification for quite a few
days now but have been hesitant to post it due to its length, and the
fact that AFAIK hardware discussion is a bit OT on this list.
I hope it may be valuable to someone Google'ing for this type of
information in the future.
--
Stan
next prev parent reply other threads:[~2011-03-20 23:22 UTC|newest]
Thread overview: 116+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15 2:06 ` Doug Dumitru
2011-02-15 4:44 ` Matt Garman
2011-02-15 5:49 ` hansbkk
2011-02-15 9:43 ` David Brown
2011-02-24 20:28 ` Matt Garman
2011-02-24 20:43 ` David Brown
2011-02-15 15:16 ` Joe Landman
2011-02-15 20:37 ` NeilBrown
2011-02-15 20:47 ` Joe Landman
2011-02-15 21:41 ` NeilBrown
2011-02-24 20:58 ` Matt Garman
2011-02-24 21:20 ` Joe Landman
2011-02-26 23:54 ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27 0:56 ` Joe Landman
2011-02-27 14:55 ` Stan Hoeppner
2011-03-12 22:49 ` Matt Garman
2011-02-27 21:30 ` high throughput storage server? Ed W
2011-02-28 15:46 ` Joe Landman
2011-02-28 23:14 ` Stan Hoeppner
2011-02-28 22:22 ` Stan Hoeppner
2011-03-02 3:44 ` Matt Garman
2011-03-02 4:20 ` Joe Landman
2011-03-02 7:10 ` Roberto Spadim
2011-03-02 19:03 ` Drew
2011-03-02 19:20 ` Roberto Spadim
2011-03-13 20:10 ` Christoph Hellwig
2011-03-14 12:27 ` Stan Hoeppner
2011-03-14 12:47 ` Christoph Hellwig
2011-03-18 13:16 ` Stan Hoeppner
2011-03-18 14:05 ` Christoph Hellwig
2011-03-18 15:43 ` Stan Hoeppner
2011-03-18 16:21 ` Roberto Spadim
2011-03-18 22:01 ` NeilBrown
2011-03-18 22:23 ` Roberto Spadim
2011-03-20 1:34 ` Stan Hoeppner
2011-03-20 3:41 ` NeilBrown
2011-03-20 5:32 ` Roberto Spadim
2011-03-20 23:22 ` Stan Hoeppner [this message]
2011-03-21 0:52 ` Roberto Spadim
2011-03-21 2:44 ` Keld Jørn Simonsen
2011-03-21 3:13 ` Roberto Spadim
2011-03-21 3:14 ` Roberto Spadim
2011-03-21 17:07 ` Stan Hoeppner
2011-03-21 14:18 ` Stan Hoeppner
2011-03-21 17:08 ` Roberto Spadim
2011-03-21 22:13 ` Keld Jørn Simonsen
2011-03-22 9:46 ` Robin Hill
2011-03-22 10:14 ` Keld Jørn Simonsen
2011-03-23 8:53 ` Stan Hoeppner
2011-03-23 15:57 ` Roberto Spadim
2011-03-23 16:19 ` Joe Landman
2011-03-24 8:05 ` Stan Hoeppner
2011-03-24 13:12 ` Joe Landman
2011-03-25 7:06 ` Stan Hoeppner
2011-03-24 17:07 ` Christoph Hellwig
2011-03-24 5:52 ` Stan Hoeppner
2011-03-24 6:33 ` NeilBrown
2011-03-24 8:07 ` Roberto Spadim
2011-03-24 8:31 ` Stan Hoeppner
2011-03-22 10:00 ` Stan Hoeppner
2011-03-22 11:01 ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45 ` Roberto Spadim
2011-02-15 13:03 ` Roberto Spadim
2011-02-24 20:43 ` Matt Garman
2011-02-24 20:53 ` Zdenek Kaspar
2011-02-24 21:07 ` Joe Landman
2011-02-15 13:39 ` David Brown
2011-02-16 23:32 ` Stan Hoeppner
2011-02-17 0:00 ` Keld Jørn Simonsen
2011-02-17 0:19 ` Stan Hoeppner
2011-02-17 2:23 ` Roberto Spadim
2011-02-17 3:05 ` Stan Hoeppner
2011-02-17 0:26 ` David Brown
2011-02-17 0:45 ` Stan Hoeppner
2011-02-17 10:39 ` David Brown
2011-02-24 20:49 ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29 ` Roberto Spadim
2011-02-15 14:51 ` A. Krijgsman
2011-02-15 16:44 ` Roberto Spadim
2011-02-15 14:56 ` Zdenek Kaspar
2011-02-24 20:36 ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36 ` Roberto Spadim
2011-02-17 13:54 ` Roberto Spadim
2011-02-17 21:47 ` Stan Hoeppner
2011-02-17 22:13 ` Joe Landman
2011-02-17 23:49 ` Stan Hoeppner
2011-02-18 0:06 ` Joe Landman
2011-02-18 3:48 ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16 ` Stan Hoeppner
2011-02-21 10:25 ` Mattias Wadenstein
2011-02-21 21:51 ` Stan Hoeppner
2011-02-22 8:57 ` David Brown
2011-02-22 9:30 ` Mattias Wadenstein
2011-02-22 9:49 ` David Brown
2011-02-22 13:38 ` Stan Hoeppner
2011-02-22 14:18 ` David Brown
2011-02-23 5:52 ` Stan Hoeppner
2011-02-23 13:56 ` David Brown
2011-02-23 14:25 ` John Robinson
2011-02-23 15:15 ` David Brown
2011-02-23 23:14 ` Stan Hoeppner
2011-02-24 10:19 ` David Brown
2011-02-23 21:59 ` Stan Hoeppner
2011-02-23 23:43 ` John Robinson
2011-02-24 15:53 ` Stan Hoeppner
2011-02-23 21:11 ` Stan Hoeppner
2011-02-24 11:24 ` David Brown
2011-02-24 23:30 ` Stan Hoeppner
2011-02-25 8:20 ` David Brown
2011-02-19 0:24 ` Joe Landman
2011-02-21 10:04 ` Mattias Wadenstein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D868C36.5050304@hardwarefreak.com \
--to=stan@hardwarefreak.com \
--cc=drew.kay@gmail.com \
--cc=hch@infradead.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=roberto@spadim.com.br \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.