linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Joe Landman <joe.landman@gmail.com>
To: Matt Garman <matthew.garman@gmail.com>
Cc: Doug Dumitru <doug@easyco.com>, Mdadm <linux-raid@vger.kernel.org>
Subject: Re: high throughput storage server?
Date: Tue, 15 Feb 2011 10:16:15 -0500	[thread overview]
Message-ID: <4D5A98BF.3030704@gmail.com> (raw)
In-Reply-To: <20110215044434.GA9186@septictank.raw-sewage.fake>

[disclosure: vendor posting, ignore if you wish, vendor html link at 
bottom of message]

On 02/14/2011 11:44 PM, Matt Garman wrote:
> On Mon, Feb 14, 2011 at 06:06:43PM -0800, Doug Dumitru wrote:
>> You have a whole slew of questions to answer before you can decide
>> on a design.  This is true if you build it yourself or decide to
>> go with a vendor and buy a supported server.  If you do go with a
>> vendor, the odds are actually quite good you will end up with
>> Linux anyway.
>
> I kind of assumed/wondered if the vendor-supplied systems didn't run
> Linux behind the scenes anyway.

We've been using Linux as the basis for our storage systems. 
Occasionally there are other OSes required by customers, but for the 
most part, Linux is the preferred platform.

[...]

>> Next, is the space all the same.  Perhaps some of it is "active"
>> and some of it is archival.  If you need 4TB of "fast" storage and
>> ...
>> well.  You can probably build this for around $5K (or maybe a bit
>> less) including a 10GigE adapter and server class components.
>
> The whole system needs to be "fast".

Ok ... sounds strange, but ...

Define what you mean by "fast".  Seriously ... we've had people tell us 
about their "huge" storage needs that we can easily fit onto a single 
small unit, no storage cluster needed.  We've had people say "fast" when 
they mean "able to keep 1 GbE port busy".

Fast needs to be articulated really in terms of what you will do with 
it.  As you noted in this and other messages, you are scaling up from 10 
compute nodes to 40 compute nodes.  4x change in demand, and I am 
guessing bandwidth (if these are large files you are streaming) or IOPs 
(if these are many small files you are reading).  Small and large here 
would mean less than 64kB for small, and greater than 4MB for large.


> Actually, to give more detail, we currently have a simple system I
> built for backup/slow access.  This is exactly what you described, a
> bunch of big, slow disks.  Lots of space, lowsy I/O performance, but
> plenty adequate for backup purposes.

Your choice is simple.  Build or buy.  Many folks have made suggestions, 
and some are pretty reasonable, though a pure SSD or Flash based 
machine, while doable (and we sell these), is quite unlikely to be close 
to the realities of your budget.  There are use cases for which this 
does make sense, but the costs are quite prohibitive for all but a few 
users.

> As of right now, we actually have about a dozen "users", i.e.
> compute servers.  The collection is basically a home-grown compute
> farm.  Each server has a gigabit ethernet connection, and 1 TB of
> RAID-1 spinning disk storage.  Each storage mounts every other
> server via NFS, and the current data is distributed evenly across
> all systems.

Ok ... this isn't something thats great to manage.  I might suggest 
looking at GlusterFS for this.  You can aggregate and distribute your 
data.  Even build in some resiliency if you wish/need.  GlusterFS 3.1.2 
is open source, so you can deploy fairly easily.

>
> So, loosely speaking, right now we have roughly 10 TB of
> "live"/"fast" data available at 1 to 10 gbps, depending on how you
> look at it.
>
> While we only have about a dozen servers now, we have definitely
> identified growing this compute farm about 4x (to 40--50 servers)
> within the next year.  But the storage capacity requirements
> shouldn't change too terribly much.  The 20 TB number was basically
> thrown out there as a "it would be nice to have 2x the live
> storage".

Without building a storage unit, you could (in concept) use GlusterFS 
for this.  In practice, this model gets harder and harder to manage as 
you increase the number of nodes.  Adding the N+1 th node means you have 
N+1 nodes to modify and manage storage on.  This does not scale well at all.

>
> I'll also add that this NAS needs to be optimized for *read*
> throughput.  As I mentioned, the only real write process is the
> daily "harvesting" of the data files.  Those are copied across
> long-haul leased lines, and the copy process isn't really
> performance sensitive.  In other words, in day-to-day use, those
> 40--50 client machines will do 100% reading from the NAS.

Ok.

This isn't a commercial.  I'll keep this part short.

We've built systems like this which sustain north of 10GB/s (big B not 
little b) for concurrent read and write access from thousands of cores. 
  20TB (and 40TB) are on the ... small ... side for this, but it is very 
doable.

As a tie in to the Linux RAID list, we use md raid for our OS drives 
(SSD pairs), and other utility functions within the unit, as well as 
striping over our hardware accelerated RAIDs.  We would like to use 
non-power of two chunk sizes, but haven't delved into the code as much 
as we'd like to see if we can make this work.

As a rule, we find mdadm to be an excellent tool, and the whole md RAID 
system to be quite good.  We may spend time at some point on figuring 
out whats wrong with the multi-threaded raid456 bit (allocated 200+ 
kernel threads last I played with it), but apart from bits like that, we 
do find it very good for production use.  It isn't as fast as some 
dedicated accelerated RAID hardware (though we have our md + kernel 
stack very well tuned so some of our software RAIDs are faster than many 
of our competitors hardware RAIDs).

You could build a fairly competent unit using md RAID.

It all gets back to build versus buy.  In either case, I'd recommend 
grabbing a copy of dstat (http://dag.wieers.com/home-made/dstat/) and 
watching your IO/network system throughput.  I am assuming 1 GbE 
switches as the basis for your cluster.  I assume this will not change. 
  The cost of your time/effort and any opportunity cost and productivity 
loss should also be accounted for in the cost-benefit analysis.  That 
is, if it costs you less overall to buy than to build, should you build 
anyway?  Generally no, but some people simply want the experience.

Big issues you need to be aware of with md raid are the hotswap problem. 
  Your SATA link needs to allow you to pull a drive out without crashing 
the machine.  Many of the on-motherboard SATA connections we've used 
over the years don't tolerate unplugs/plugins very well.  I'd recommend 
at least an reasonable HBA for this that understands hot swap and 
handles it correctly (you need hardware and driver level support to 
correctly signal the kernel of these events).

If you decide to buy, have a really clear idea of your performance 
regime, and a realistic eye towards budget.  A 48 TB server with > 2GB/s 
streaming performance for TB sized files is very doable, well under $30k 
USD.  A 48 TB software RAID version would be quite a bit less than that.

Good luck with this, and let us know what you do.

vendor html link:  http://scalableinformatics.com , our storage clusters 
http://scalableinformatics.com/sicluster

  parent reply	other threads:[~2011-02-15 15:16 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15  2:06 ` Doug Dumitru
2011-02-15  4:44   ` Matt Garman
2011-02-15  5:49     ` hansbkk
2011-02-15  9:43     ` David Brown
2011-02-24 20:28       ` Matt Garman
2011-02-24 20:43         ` David Brown
2011-02-15 15:16     ` Joe Landman [this message]
2011-02-15 20:37       ` NeilBrown
2011-02-15 20:47         ` Joe Landman
2011-02-15 21:41           ` NeilBrown
2011-02-24 20:58       ` Matt Garman
2011-02-24 21:20         ` Joe Landman
2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27  0:56             ` Joe Landman
2011-02-27 14:55               ` Stan Hoeppner
2011-03-12 22:49                 ` Matt Garman
2011-02-27 21:30     ` high throughput storage server? Ed W
2011-02-28 15:46       ` Joe Landman
2011-02-28 23:14         ` Stan Hoeppner
2011-02-28 22:22       ` Stan Hoeppner
2011-03-02  3:44       ` Matt Garman
2011-03-02  4:20         ` Joe Landman
2011-03-02  7:10           ` Roberto Spadim
2011-03-02 19:03             ` Drew
2011-03-02 19:20               ` Roberto Spadim
2011-03-13 20:10                 ` Christoph Hellwig
2011-03-14 12:27                   ` Stan Hoeppner
2011-03-14 12:47                     ` Christoph Hellwig
2011-03-18 13:16                       ` Stan Hoeppner
2011-03-18 14:05                         ` Christoph Hellwig
2011-03-18 15:43                           ` Stan Hoeppner
2011-03-18 16:21                             ` Roberto Spadim
2011-03-18 22:01                             ` NeilBrown
2011-03-18 22:23                               ` Roberto Spadim
2011-03-20  1:34                               ` Stan Hoeppner
2011-03-20  3:41                                 ` NeilBrown
2011-03-20  5:32                                   ` Roberto Spadim
2011-03-20 23:22                                     ` Stan Hoeppner
2011-03-21  0:52                                       ` Roberto Spadim
2011-03-21  2:44                                       ` Keld Jørn Simonsen
2011-03-21  3:13                                         ` Roberto Spadim
2011-03-21  3:14                                           ` Roberto Spadim
2011-03-21 17:07                                             ` Stan Hoeppner
2011-03-21 14:18                                         ` Stan Hoeppner
2011-03-21 17:08                                           ` Roberto Spadim
2011-03-21 22:13                                           ` Keld Jørn Simonsen
2011-03-22  9:46                                             ` Robin Hill
2011-03-22 10:14                                               ` Keld Jørn Simonsen
2011-03-23  8:53                                                 ` Stan Hoeppner
2011-03-23 15:57                                                   ` Roberto Spadim
2011-03-23 16:19                                                     ` Joe Landman
2011-03-24  8:05                                                       ` Stan Hoeppner
2011-03-24 13:12                                                         ` Joe Landman
2011-03-25  7:06                                                           ` Stan Hoeppner
2011-03-24 17:07                                                       ` Christoph Hellwig
2011-03-24  5:52                                                     ` Stan Hoeppner
2011-03-24  6:33                                                       ` NeilBrown
2011-03-24  8:07                                                         ` Roberto Spadim
2011-03-24  8:31                                                         ` Stan Hoeppner
2011-03-22 10:00                                             ` Stan Hoeppner
2011-03-22 11:01                                               ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45   ` Roberto Spadim
2011-02-15 13:03     ` Roberto Spadim
2011-02-24 20:43       ` Matt Garman
2011-02-24 20:53         ` Zdenek Kaspar
2011-02-24 21:07           ` Joe Landman
2011-02-15 13:39   ` David Brown
2011-02-16 23:32     ` Stan Hoeppner
2011-02-17  0:00       ` Keld Jørn Simonsen
2011-02-17  0:19         ` Stan Hoeppner
2011-02-17  2:23           ` Roberto Spadim
2011-02-17  3:05             ` Stan Hoeppner
2011-02-17  0:26       ` David Brown
2011-02-17  0:45         ` Stan Hoeppner
2011-02-17 10:39           ` David Brown
2011-02-24 20:49     ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29   ` Roberto Spadim
2011-02-15 14:51     ` A. Krijgsman
2011-02-15 16:44       ` Roberto Spadim
2011-02-15 14:56     ` Zdenek Kaspar
2011-02-24 20:36       ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36   ` Roberto Spadim
2011-02-17 13:54     ` Roberto Spadim
2011-02-17 21:47   ` Stan Hoeppner
2011-02-17 22:13     ` Joe Landman
2011-02-17 23:49       ` Stan Hoeppner
2011-02-18  0:06         ` Joe Landman
2011-02-18  3:48           ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16   ` Stan Hoeppner
2011-02-21 10:25     ` Mattias Wadenstein
2011-02-21 21:51       ` Stan Hoeppner
2011-02-22  8:57         ` David Brown
2011-02-22  9:30           ` Mattias Wadenstein
2011-02-22  9:49             ` David Brown
2011-02-22 13:38           ` Stan Hoeppner
2011-02-22 14:18             ` David Brown
2011-02-23  5:52               ` Stan Hoeppner
2011-02-23 13:56                 ` David Brown
2011-02-23 14:25                   ` John Robinson
2011-02-23 15:15                     ` David Brown
2011-02-23 23:14                       ` Stan Hoeppner
2011-02-24 10:19                         ` David Brown
2011-02-23 21:59                     ` Stan Hoeppner
2011-02-23 23:43                       ` John Robinson
2011-02-24 15:53                         ` Stan Hoeppner
2011-02-23 21:11                   ` Stan Hoeppner
2011-02-24 11:24                     ` David Brown
2011-02-24 23:30                       ` Stan Hoeppner
2011-02-25  8:20                         ` David Brown
2011-02-19  0:24   ` Joe Landman
2011-02-21 10:04     ` Mattias Wadenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D5A98BF.3030704@gmail.com \
    --to=joe.landman@gmail.com \
    --cc=doug@easyco.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=matthew.garman@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).