linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Joe Landman <joe.landman@gmail.com>
To: Mattias Wadenstein <maswan@acc.umu.se>
Cc: Matt Garman <matthew.garman@gmail.com>,
	Mdadm <linux-raid@vger.kernel.org>
Subject: Re: high throughput storage server?
Date: Fri, 18 Feb 2011 19:24:58 -0500	[thread overview]
Message-ID: <4D5F0DDA.1000100@gmail.com> (raw)
In-Reply-To: <Pine.GSO.4.64.1102181429201.7398@montezuma.acc.umu.se>

On 02/18/2011 08:49 AM, Mattias Wadenstein wrote:
> On Mon, 14 Feb 2011, Matt Garman wrote:

[...]

>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>
> Well, this seems fairly close to the LHC data analysis case, or HPC
> usage in general, both of which I'm rather familiar with.

Its similar to many HPC workloads dealing with large data sets.  There's 
nothing unusual about this in the HPC world.

>
>> My initial thoughts/questions are:
>>
>> (1) We need lots of spindles (i.e. many small disks rather than
>> few big disks). How do you compute disk throughput when there are
>> multiple consumers? Most manufacturers provide specs on their drives
>> such as sustained linear read throughput. But how is that number
>> affected when there are multiple processes simultanesously trying to
>> access different data? Is the sustained bulk read throughput value
>> inversely proportional to the number of consumers? (E.g. 100 MB/s
>> drive only does 33 MB/s w/three consumers.) Or is there are more
>> specific way to estimate this?
>
> This is tricky. In general there isn't a good way of estimating this,
> because so much about this involves the way your load interacts with
> IO-scheduling in both Linux and (if you use them) raid controllers, etc.
>
> The actual IO pattern of your workload is probably the biggest factor
> here, determining both if readahead will give any benefits, as well as
> how much sequential IO can be done as opposed to just seeking.

Absolutely.

Good real time data can be had from a number of tools.  Collectl, 
iostat, etc (sar ...).  I personally like atop for the "dashboard" like 
view.  Collectl and others can get you even more data that you can analyze.

>
>> (2) The big storage server(s) need to connect to the network via
>> multiple bonded Gigabit ethernet, or something faster like
>> FibreChannel or 10 GbE. That seems pretty straightforward.
>
> I'd also look at the option of many small&cheap servers, especially if
> the load is spread out fairly even over the filesets.

Here is where things like GlusterFS and FhGFS shine.  When Ceph firms up 
you can use this.  Happily all of these do run atop an MD raid device 
(to tie into the list).

>> (3) This will probably require multiple servers connected together
>> somehow and presented to the compute machines as one big data store.
>> This is where I really don't know much of anything. I did a quick
>> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
>> (based on the observation that 24-bay rackmount enclosures seem to be
>> fairly common). Such a system would only provide 7.2 TB of storage
>> using a scheme like RAID-10. So how could two or three of these
>> servers be "chained" together and look like a single large data pool
>> to the analysis machines?
>
> Here you would either maintain a large list of nfs mounts for the read
> load, or start looking at a distributed filesystem. Sticking them all
> into one big fileserver is easier on the administration part, but
> quickly gets really expensive when you look to put multiple 10GE
> interfaces on it.
>
> If the load is almost all read and seldom updated, and you can afford
> the time to manually layout data files over the servers, the nfs mounts
> option might work well for you. If the analysis cluster also creates
> files here and there you might need a parallel filesystem.

One of the nicer aspects of GlusterFS in this context is that it 
provides an NFS compatible server that NFS clients can connect to.  Some 
things aren't supported right now in the current release, but I 
anticipate they will be soon.

Moreover, with the distribute mode, it will do a reasonable job of 
distributing the files among the nodes.  Sort of like the nfs layout 
model, but with a "random" distribution.  This should be, on average, 
reasonably good.

>
> 2U machines with 12 3.5" or 16-24 2.5" hdd slots can be gotten pretty
> cheaply. Add a quad-gige card if your load can get decent sequential
> load, or look at fast/ssd 2.5" drives if you are mostly short random
> reads. Then add as many as you need to sustain the analysis speed you
> need. The advantage here is that this is really scalable, if you double
> the number of servers you get at least twice the IO capacity.
>
> Oh, yet another setup I've seen is adding a some (2-4) fast disks to
> each of the analysis machines and then running a distributed replicated
> filesystem like hadoop over them.

Ugh ... short-stroking drives or using SSDs?  Quite cost-inefficient for 
this work.  And given the HPC nature of the problem, its probably a good 
idea to aim for more cost-efficient.

This said, I'd recommend at least looking at GlusterFS.  Put it atop an 
MD raid (6 or 10), and you should be in pretty good shape with the right 
network design.  That is, as long as you don't use a bad SATA/SAS HBA.

Joe
-- 
Joe Landman
landman@scalableinformatics.com

  parent reply	other threads:[~2011-02-19  0:24 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15  2:06 ` Doug Dumitru
2011-02-15  4:44   ` Matt Garman
2011-02-15  5:49     ` hansbkk
2011-02-15  9:43     ` David Brown
2011-02-24 20:28       ` Matt Garman
2011-02-24 20:43         ` David Brown
2011-02-15 15:16     ` Joe Landman
2011-02-15 20:37       ` NeilBrown
2011-02-15 20:47         ` Joe Landman
2011-02-15 21:41           ` NeilBrown
2011-02-24 20:58       ` Matt Garman
2011-02-24 21:20         ` Joe Landman
2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27  0:56             ` Joe Landman
2011-02-27 14:55               ` Stan Hoeppner
2011-03-12 22:49                 ` Matt Garman
2011-02-27 21:30     ` high throughput storage server? Ed W
2011-02-28 15:46       ` Joe Landman
2011-02-28 23:14         ` Stan Hoeppner
2011-02-28 22:22       ` Stan Hoeppner
2011-03-02  3:44       ` Matt Garman
2011-03-02  4:20         ` Joe Landman
2011-03-02  7:10           ` Roberto Spadim
2011-03-02 19:03             ` Drew
2011-03-02 19:20               ` Roberto Spadim
2011-03-13 20:10                 ` Christoph Hellwig
2011-03-14 12:27                   ` Stan Hoeppner
2011-03-14 12:47                     ` Christoph Hellwig
2011-03-18 13:16                       ` Stan Hoeppner
2011-03-18 14:05                         ` Christoph Hellwig
2011-03-18 15:43                           ` Stan Hoeppner
2011-03-18 16:21                             ` Roberto Spadim
2011-03-18 22:01                             ` NeilBrown
2011-03-18 22:23                               ` Roberto Spadim
2011-03-20  1:34                               ` Stan Hoeppner
2011-03-20  3:41                                 ` NeilBrown
2011-03-20  5:32                                   ` Roberto Spadim
2011-03-20 23:22                                     ` Stan Hoeppner
2011-03-21  0:52                                       ` Roberto Spadim
2011-03-21  2:44                                       ` Keld Jørn Simonsen
2011-03-21  3:13                                         ` Roberto Spadim
2011-03-21  3:14                                           ` Roberto Spadim
2011-03-21 17:07                                             ` Stan Hoeppner
2011-03-21 14:18                                         ` Stan Hoeppner
2011-03-21 17:08                                           ` Roberto Spadim
2011-03-21 22:13                                           ` Keld Jørn Simonsen
2011-03-22  9:46                                             ` Robin Hill
2011-03-22 10:14                                               ` Keld Jørn Simonsen
2011-03-23  8:53                                                 ` Stan Hoeppner
2011-03-23 15:57                                                   ` Roberto Spadim
2011-03-23 16:19                                                     ` Joe Landman
2011-03-24  8:05                                                       ` Stan Hoeppner
2011-03-24 13:12                                                         ` Joe Landman
2011-03-25  7:06                                                           ` Stan Hoeppner
2011-03-24 17:07                                                       ` Christoph Hellwig
2011-03-24  5:52                                                     ` Stan Hoeppner
2011-03-24  6:33                                                       ` NeilBrown
2011-03-24  8:07                                                         ` Roberto Spadim
2011-03-24  8:31                                                         ` Stan Hoeppner
2011-03-22 10:00                                             ` Stan Hoeppner
2011-03-22 11:01                                               ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45   ` Roberto Spadim
2011-02-15 13:03     ` Roberto Spadim
2011-02-24 20:43       ` Matt Garman
2011-02-24 20:53         ` Zdenek Kaspar
2011-02-24 21:07           ` Joe Landman
2011-02-15 13:39   ` David Brown
2011-02-16 23:32     ` Stan Hoeppner
2011-02-17  0:00       ` Keld Jørn Simonsen
2011-02-17  0:19         ` Stan Hoeppner
2011-02-17  2:23           ` Roberto Spadim
2011-02-17  3:05             ` Stan Hoeppner
2011-02-17  0:26       ` David Brown
2011-02-17  0:45         ` Stan Hoeppner
2011-02-17 10:39           ` David Brown
2011-02-24 20:49     ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29   ` Roberto Spadim
2011-02-15 14:51     ` A. Krijgsman
2011-02-15 16:44       ` Roberto Spadim
2011-02-15 14:56     ` Zdenek Kaspar
2011-02-24 20:36       ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36   ` Roberto Spadim
2011-02-17 13:54     ` Roberto Spadim
2011-02-17 21:47   ` Stan Hoeppner
2011-02-17 22:13     ` Joe Landman
2011-02-17 23:49       ` Stan Hoeppner
2011-02-18  0:06         ` Joe Landman
2011-02-18  3:48           ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16   ` Stan Hoeppner
2011-02-21 10:25     ` Mattias Wadenstein
2011-02-21 21:51       ` Stan Hoeppner
2011-02-22  8:57         ` David Brown
2011-02-22  9:30           ` Mattias Wadenstein
2011-02-22  9:49             ` David Brown
2011-02-22 13:38           ` Stan Hoeppner
2011-02-22 14:18             ` David Brown
2011-02-23  5:52               ` Stan Hoeppner
2011-02-23 13:56                 ` David Brown
2011-02-23 14:25                   ` John Robinson
2011-02-23 15:15                     ` David Brown
2011-02-23 23:14                       ` Stan Hoeppner
2011-02-24 10:19                         ` David Brown
2011-02-23 21:59                     ` Stan Hoeppner
2011-02-23 23:43                       ` John Robinson
2011-02-24 15:53                         ` Stan Hoeppner
2011-02-23 21:11                   ` Stan Hoeppner
2011-02-24 11:24                     ` David Brown
2011-02-24 23:30                       ` Stan Hoeppner
2011-02-25  8:20                         ` David Brown
2011-02-19  0:24   ` Joe Landman [this message]
2011-02-21 10:04     ` Mattias Wadenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D5F0DDA.1000100@gmail.com \
    --to=joe.landman@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=maswan@acc.umu.se \
    --cc=matthew.garman@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).