linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Brown <david@westcontrol.com>
To: linux-raid@vger.kernel.org
Subject: Re: high throughput storage server?
Date: Tue, 15 Feb 2011 14:39:14 +0100	[thread overview]
Message-ID: <ijdvmn$4ff$1@dough.gmane.org> (raw)
In-Reply-To: <4D5A7198.7060607@hardwarefreak.com>

On 15/02/2011 13:29, Stan Hoeppner wrote:
> Matt Garman put forth on 2/14/2011 5:59 PM:
>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster.  These machines all need access to a shared 20 TB pool of
>> storage.  Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool.  In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> If your description of the requirement is accurate, then what you need is a
> _reliable_ high performance NFS server backed by many large/fast spindles.
>
>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>
> My thoughtful, considered, recommendation would be to stay away from a DIY build
> for the requirement you describe, and stay away from mdraid as well, but not
> because mdraid isn't up to the task.  I get the feeling you don't fully grasp
> some of the consequences of a less than expert level mdraid admin being
> responsible for such a system after it's in production.  If multiple drives are
> kicked off line simultaneously (posts of such seem to occur multiple times/week
> here), downing the array, are you capable of bringing it back online intact,
> successfully, without outside assistance, in a short period of time?  If you
> lose the entire array due to a typo'd mdadm parm, then what?
>

This brings up an important point - no matter what sort of system you 
get (home made, mdadm raid, or whatever) you will want to do some tests 
and drills at replacing failed drives.  Also make sure everything is 
well documented, and well labelled.  When mdadm sends you an email 
telling you drive sdx has failed, you want to be /very/ sure you know 
which drive is sdx before you take it out!



You also want to consider your raid setup carefully.  RAID 10 has been 
mentioned here several times - it is often a good choice, but not 
necessarily.  RAID 10 gives you fast recovery, and can at best survive a 
loss of half your disks - but at worst a loss of two disks will bring 
down the whole set.  It is also very inefficient in space.  If you use 
SSDs, it may not be worth double the price to have RAID 10.  If you use 
hard disks, it may not be sufficient safety.

I haven't build a raid of anything like this size, so my comments here 
are only based on my imperfect understanding of the theory - I'm 
learning too.

RAID 10 has the advantage of good speed at reading (close to RAID 0 
speeds), at the cost of poorer write speed and poor space efficiency. 
RAID 5 and RAID 6 are space efficient, and fast for most purposes, but 
slow for rebuilds and slow for small writes.

You are not much bothered about write performance, and most of your 
writes are large anyway.

How about building the array as a two-tier RAID 6+5 setup?  Take 7 x 1TB 
disks as a RAID 6 for 5 TB space.  Five sets of these as RAID 5 gives 
you your 20 TB in 35 drives.  This will survive any four failed disks, 
or more depending on the combinations.  If you are careful how it is 
arranged, it will also survive a failing controller card.

If a disk fails, you could remove that whole set from the outer array 
(which should have a write intent bitmap) - then the rebuild will go at 
maximal speed, while the outer array's speed will not be so badly 
affected.  Once the rebuild is complete, put it back in the outer array. 
  Since you are not doing many writes, it will not take long to catch up.

It is probably worth having a small array of SSDs (RAID1 or RAID10) to 
hold the write intent bitmap, the journal for your main file system, and 
of course your OS.  Maybe one of these absurdly fast PCI Express flash 
disks would be a good choice.




  parent reply	other threads:[~2011-02-15 13:39 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15  2:06 ` Doug Dumitru
2011-02-15  4:44   ` Matt Garman
2011-02-15  5:49     ` hansbkk
2011-02-15  9:43     ` David Brown
2011-02-24 20:28       ` Matt Garman
2011-02-24 20:43         ` David Brown
2011-02-15 15:16     ` Joe Landman
2011-02-15 20:37       ` NeilBrown
2011-02-15 20:47         ` Joe Landman
2011-02-15 21:41           ` NeilBrown
2011-02-24 20:58       ` Matt Garman
2011-02-24 21:20         ` Joe Landman
2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27  0:56             ` Joe Landman
2011-02-27 14:55               ` Stan Hoeppner
2011-03-12 22:49                 ` Matt Garman
2011-02-27 21:30     ` high throughput storage server? Ed W
2011-02-28 15:46       ` Joe Landman
2011-02-28 23:14         ` Stan Hoeppner
2011-02-28 22:22       ` Stan Hoeppner
2011-03-02  3:44       ` Matt Garman
2011-03-02  4:20         ` Joe Landman
2011-03-02  7:10           ` Roberto Spadim
2011-03-02 19:03             ` Drew
2011-03-02 19:20               ` Roberto Spadim
2011-03-13 20:10                 ` Christoph Hellwig
2011-03-14 12:27                   ` Stan Hoeppner
2011-03-14 12:47                     ` Christoph Hellwig
2011-03-18 13:16                       ` Stan Hoeppner
2011-03-18 14:05                         ` Christoph Hellwig
2011-03-18 15:43                           ` Stan Hoeppner
2011-03-18 16:21                             ` Roberto Spadim
2011-03-18 22:01                             ` NeilBrown
2011-03-18 22:23                               ` Roberto Spadim
2011-03-20  1:34                               ` Stan Hoeppner
2011-03-20  3:41                                 ` NeilBrown
2011-03-20  5:32                                   ` Roberto Spadim
2011-03-20 23:22                                     ` Stan Hoeppner
2011-03-21  0:52                                       ` Roberto Spadim
2011-03-21  2:44                                       ` Keld Jørn Simonsen
2011-03-21  3:13                                         ` Roberto Spadim
2011-03-21  3:14                                           ` Roberto Spadim
2011-03-21 17:07                                             ` Stan Hoeppner
2011-03-21 14:18                                         ` Stan Hoeppner
2011-03-21 17:08                                           ` Roberto Spadim
2011-03-21 22:13                                           ` Keld Jørn Simonsen
2011-03-22  9:46                                             ` Robin Hill
2011-03-22 10:14                                               ` Keld Jørn Simonsen
2011-03-23  8:53                                                 ` Stan Hoeppner
2011-03-23 15:57                                                   ` Roberto Spadim
2011-03-23 16:19                                                     ` Joe Landman
2011-03-24  8:05                                                       ` Stan Hoeppner
2011-03-24 13:12                                                         ` Joe Landman
2011-03-25  7:06                                                           ` Stan Hoeppner
2011-03-24 17:07                                                       ` Christoph Hellwig
2011-03-24  5:52                                                     ` Stan Hoeppner
2011-03-24  6:33                                                       ` NeilBrown
2011-03-24  8:07                                                         ` Roberto Spadim
2011-03-24  8:31                                                         ` Stan Hoeppner
2011-03-22 10:00                                             ` Stan Hoeppner
2011-03-22 11:01                                               ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45   ` Roberto Spadim
2011-02-15 13:03     ` Roberto Spadim
2011-02-24 20:43       ` Matt Garman
2011-02-24 20:53         ` Zdenek Kaspar
2011-02-24 21:07           ` Joe Landman
2011-02-15 13:39   ` David Brown [this message]
2011-02-16 23:32     ` Stan Hoeppner
2011-02-17  0:00       ` Keld Jørn Simonsen
2011-02-17  0:19         ` Stan Hoeppner
2011-02-17  2:23           ` Roberto Spadim
2011-02-17  3:05             ` Stan Hoeppner
2011-02-17  0:26       ` David Brown
2011-02-17  0:45         ` Stan Hoeppner
2011-02-17 10:39           ` David Brown
2011-02-24 20:49     ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29   ` Roberto Spadim
2011-02-15 14:51     ` A. Krijgsman
2011-02-15 16:44       ` Roberto Spadim
2011-02-15 14:56     ` Zdenek Kaspar
2011-02-24 20:36       ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36   ` Roberto Spadim
2011-02-17 13:54     ` Roberto Spadim
2011-02-17 21:47   ` Stan Hoeppner
2011-02-17 22:13     ` Joe Landman
2011-02-17 23:49       ` Stan Hoeppner
2011-02-18  0:06         ` Joe Landman
2011-02-18  3:48           ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16   ` Stan Hoeppner
2011-02-21 10:25     ` Mattias Wadenstein
2011-02-21 21:51       ` Stan Hoeppner
2011-02-22  8:57         ` David Brown
2011-02-22  9:30           ` Mattias Wadenstein
2011-02-22  9:49             ` David Brown
2011-02-22 13:38           ` Stan Hoeppner
2011-02-22 14:18             ` David Brown
2011-02-23  5:52               ` Stan Hoeppner
2011-02-23 13:56                 ` David Brown
2011-02-23 14:25                   ` John Robinson
2011-02-23 15:15                     ` David Brown
2011-02-23 23:14                       ` Stan Hoeppner
2011-02-24 10:19                         ` David Brown
2011-02-23 21:59                     ` Stan Hoeppner
2011-02-23 23:43                       ` John Robinson
2011-02-24 15:53                         ` Stan Hoeppner
2011-02-23 21:11                   ` Stan Hoeppner
2011-02-24 11:24                     ` David Brown
2011-02-24 23:30                       ` Stan Hoeppner
2011-02-25  8:20                         ` David Brown
2011-02-19  0:24   ` Joe Landman
2011-02-21 10:04     ` Mattias Wadenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='ijdvmn$4ff$1@dough.gmane.org' \
    --to=david@westcontrol.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).