linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Roberto Spadim <roberto@spadim.com.br>
To: NeilBrown <neilb@suse.de>
Cc: Stan Hoeppner <stan@hardwarefreak.com>,
	Christoph Hellwig <hch@infradead.org>, Drew <drew.kay@gmail.com>,
	Mdadm <linux-raid@vger.kernel.org>
Subject: Re: high throughput storage server?
Date: Sun, 20 Mar 2011 02:32:38 -0300	[thread overview]
Message-ID: <AANLkTi=2k2=YuZAggonLfKmRFxFd-rXvNo=xkpqWyQNU@mail.gmail.com> (raw)
In-Reply-To: <20110320144147.29141f04@notabene.brown>

with 2 disks md raid0 i get 400MB/s SAS 10krpm 6gb/s channel
you will need at last 10000/400*2=25*2=50 disks to get a start number
memory/cpu/network speed?
memory must allow more than 10gb/s (ddr3 can do this, i don't know if
enabled ecc will be a problem or not, check with memtest86+)
cpu? hummm i don't know very well how to help here, since it's just
read and write memory/interfaces (network/disks), maybe a 'magic'
number like: 3ghz * 64bits/8bits=24.000 (maybe 24gbits/s) i don't know
how to estimate... but i think you will need a multicore cpu... maybe
one for network one for disks one for mdadm one for nfs and one for
linux, >=5 cores at least with 3ghz 64bits each (maybe starting with
xeon 6cores with hyper thread)
it's just a idea how to estimate, it's not correct/true/real
i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
about the problem, post results here, this is a nice hardware question
:)
don't tell about software raid, just the hardware to allow this
bandwidth (10gb/s) and share files

2011/3/20 NeilBrown <neilb@suse.de>:
> On Sat, 19 Mar 2011 20:34:26 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>
>> NeilBrown put forth on 3/18/2011 5:01 PM:
>> > On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner <stan@hardwarefreak.com>
>> > wrote:
>> >
>> >> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>> >>
>> >> Thanks for the confirmations and explanations.
>> >>
>> >>> The kernel is pretty smart in placement of user and page cache data, but
>> >>> it can't really second guess your intention.  With the numactl tool you
>> >>> can help it doing the proper placement for you workload.  Note that the
>> >>> choice isn't always trivial - a numa system tends to have memory on
>> >>> multiple nodes, so you'll either have to find a good partitioning of
>> >>> your workload or live with off-node references.  I don't think
>> >>> partitioning NFS workloads is trivial, but then again I'm not a
>> >>> networking expert.
>> >>
>> >> Bringing mdraid back into the fold, I'm wondering what kinda of load the
>> >> mdraid threads would place on a system of the caliber needed to push
>> >> 10GB/s NFS.
>> >>
>> >> Neil, I spent quite a bit of time yesterday spec'ing out what I believe
>> >
>> > Addressing me directly in an email that wasn't addressed to me directly seem
>> > a bit ... odd.  Maybe that is just me.
>>
>> I guess that depends on one's perspective.  Is it the content of email
>> To: and Cc: headers that matters, or the substance of the list
>> discussion thread?  You are the lead developer and maintainer of Linux
>> mdraid AFAIK.  Thus I would have assumed that directly addressing a
>> question to you within any given list thread was acceptable, regardless
>> of whose address was where in the email headers.
>
> This assumes that I read every email on this list.  I certainly do read a lot,
> but I tend to tune out of threads that don't seem particularly interesting -
> and configuring hardware is only vaguely interesting to me - and I am sure
> there are people on the list with more experience.
>
> But whatever... there is certainly more chance of me missing something that
> isn't directly addressed to me (such messages get filed differently).
>
>
>>
>> >> How much of each core's cycles will we consume with normal random read
>> >
>> > For RAID10, the md thread plays no part in reads.  Which ever thread
>> > submitted the read submits it all the way down to the relevant member device.
>> > If the read fails the thread will come in to play.
>>
>> So with RIAD10 read scalability is in essence limited to the execution
>> rate of the block device layer code and the interconnect b/w required.
>
> Correct.
>
>>
>> > For writes, the thread is used primarily to make sure the writes are properly
>> > orders w.r.t. bitmap updates.  I could probably remove that requirement if a
>> > bitmap was not in use...
>>
>> How compute intensive is this thread during writes, if at all, at
>> extreme IO bandwidth rates?
>
> Not compute intensive at all - just single threaded.  So it will only
> dispatch a single request at a time.  Whether single threading the writes is
> good or bad is not something that I'm completely clear on.  It seems bad in
> the sense that modern machines have lots of CPUs and we are fore-going any
> possible benefits of parallelism.  However the current VM seems to do all
> (or most) writeout from a single thread per device - the 'bdi' threads.
> So maybe keeping it single threaded in the md level is perfectly natural and
> avoids cache bouncing...
>
>
>>
>> >> operations assuming 10GB/s of continuous aggregate throughput?  Would
>> >> the mdraid threads consume sufficient cycles that when combined with
>> >> network stack processing and interrupt processing, that 16 cores at
>> >> 2.4GHz would be insufficient?  If so, would bumping the two sockets up
>> >> to 24 cores at 2.1GHz be enough for the total workload?  Or, would we
>> >> need to move to a 4 socket system with 32 or 48 cores?
>> >>
>> >> Is this possibly a situation where mdraid just isn't suitable due to the
>> >> CPU, memory, and interconnect bandwidth demands, making hardware RAID
>> >> the only real option?
>> >
>> > I'm sorry, but I don't do resource usage estimates or comparisons with
>> > hardware raid.  I just do software design and coding.
>>
>> I probably worded this question very poorly and have possibly made
>> unfair assumptions about mdraid performance.
>>
>> >>     And if it does requires hardware RAID, would it
>> >> be possible to stick 16 block devices together in a --linear mdraid
>> >> array and maintain the 10GB/s performance?  Or, would the single
>> >> --linear array be processed by a single thread?  If so, would a single
>> >> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
>> >> devices at 10GB/s aggregate?
>> >
>> > There is no thread for linear or RAID0.
>>
>> What kernel code is responsible for the concatenation and striping
>> operations of mdraid linear and RAID0 if not an mdraid thread?
>>
>
> When the VM or filesystem or whatever wants to start an IO request, it calls
> into the md code to find out how big it is allowed to make that request.  The
> md code returns a number which ensures that the request will end up being
> mapped onto just one drive (at least in the majority of cases).
> The VM or filesystem builds up the request (a struct bio) to at most that
> size and hands it to md.  md simply assigns a different target device and
> offset in that device to the request, and hands it over the the target device.
>
> So whatever thread it was that started the request carries it all the way
> down to the device which is a member of the RAID array (for RAID0/linear).
> Typically it then gets placed on a queue, and an interrupt handler takes it
> off the queue and acts upon it.
>
> So - no separate md thread.
>
> RAID1 and RAID10 make only limited use of their thread, doing as much of the
> work as possible in the original calling thread.
> RAID4/5/6 do lots of work in the md thread.  The calling thread just finds a
> place in the stripe cache to attach the request, attaches it, and signals the
> thread.
> (Though reads on a non-degraded array can by-pass the cache and are handled
> much like reads on RAID0).
>
>> > If you want to share load over a number of devices, you would normally use
>> > RAID0.  However if the load had a high thread count and the filesystem
>> > distributed IO evenly across the whole device space, then linear might work
>> > for you.
>>
>> In my scenario I'm thinking I'd want to stay away RAID0 because of the
>> multi-level stripe width issues of double nested RAID (RAID0 over
>> RAID10).  I assumed linear would be the way to go, as my scenario calls
>> for using XFS.  Using 32 allocation groups should evenly spread the
>> load, which is ~50 NFS clients.
>
> You may well be right.
>
>>
>> What I'm trying to figure out is how much CPU time I am going to need for:
>>
>> 1.  Aggregate 10GB/s IO rate
>> 2.  mdraid managing 384 drives
>>     A.  16 mdraid10 arrays of 24 drives each
>>     B.  mdraid linear concatenating the 16 arrays
>
> I very much doubt that CPU is going to be an issue.  Memory bandwidth might -
> but I'm only really guessing here, so it is probably time to stop.
>
>
>>
>> Thanks for your input Neil.
>>
> Pleasure.
>
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2011-03-20  5:32 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15  2:06 ` Doug Dumitru
2011-02-15  4:44   ` Matt Garman
2011-02-15  5:49     ` hansbkk
2011-02-15  9:43     ` David Brown
2011-02-24 20:28       ` Matt Garman
2011-02-24 20:43         ` David Brown
2011-02-15 15:16     ` Joe Landman
2011-02-15 20:37       ` NeilBrown
2011-02-15 20:47         ` Joe Landman
2011-02-15 21:41           ` NeilBrown
2011-02-24 20:58       ` Matt Garman
2011-02-24 21:20         ` Joe Landman
2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27  0:56             ` Joe Landman
2011-02-27 14:55               ` Stan Hoeppner
2011-03-12 22:49                 ` Matt Garman
2011-02-27 21:30     ` high throughput storage server? Ed W
2011-02-28 15:46       ` Joe Landman
2011-02-28 23:14         ` Stan Hoeppner
2011-02-28 22:22       ` Stan Hoeppner
2011-03-02  3:44       ` Matt Garman
2011-03-02  4:20         ` Joe Landman
2011-03-02  7:10           ` Roberto Spadim
2011-03-02 19:03             ` Drew
2011-03-02 19:20               ` Roberto Spadim
2011-03-13 20:10                 ` Christoph Hellwig
2011-03-14 12:27                   ` Stan Hoeppner
2011-03-14 12:47                     ` Christoph Hellwig
2011-03-18 13:16                       ` Stan Hoeppner
2011-03-18 14:05                         ` Christoph Hellwig
2011-03-18 15:43                           ` Stan Hoeppner
2011-03-18 16:21                             ` Roberto Spadim
2011-03-18 22:01                             ` NeilBrown
2011-03-18 22:23                               ` Roberto Spadim
2011-03-20  1:34                               ` Stan Hoeppner
2011-03-20  3:41                                 ` NeilBrown
2011-03-20  5:32                                   ` Roberto Spadim [this message]
2011-03-20 23:22                                     ` Stan Hoeppner
2011-03-21  0:52                                       ` Roberto Spadim
2011-03-21  2:44                                       ` Keld Jørn Simonsen
2011-03-21  3:13                                         ` Roberto Spadim
2011-03-21  3:14                                           ` Roberto Spadim
2011-03-21 17:07                                             ` Stan Hoeppner
2011-03-21 14:18                                         ` Stan Hoeppner
2011-03-21 17:08                                           ` Roberto Spadim
2011-03-21 22:13                                           ` Keld Jørn Simonsen
2011-03-22  9:46                                             ` Robin Hill
2011-03-22 10:14                                               ` Keld Jørn Simonsen
2011-03-23  8:53                                                 ` Stan Hoeppner
2011-03-23 15:57                                                   ` Roberto Spadim
2011-03-23 16:19                                                     ` Joe Landman
2011-03-24  8:05                                                       ` Stan Hoeppner
2011-03-24 13:12                                                         ` Joe Landman
2011-03-25  7:06                                                           ` Stan Hoeppner
2011-03-24 17:07                                                       ` Christoph Hellwig
2011-03-24  5:52                                                     ` Stan Hoeppner
2011-03-24  6:33                                                       ` NeilBrown
2011-03-24  8:07                                                         ` Roberto Spadim
2011-03-24  8:31                                                         ` Stan Hoeppner
2011-03-22 10:00                                             ` Stan Hoeppner
2011-03-22 11:01                                               ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45   ` Roberto Spadim
2011-02-15 13:03     ` Roberto Spadim
2011-02-24 20:43       ` Matt Garman
2011-02-24 20:53         ` Zdenek Kaspar
2011-02-24 21:07           ` Joe Landman
2011-02-15 13:39   ` David Brown
2011-02-16 23:32     ` Stan Hoeppner
2011-02-17  0:00       ` Keld Jørn Simonsen
2011-02-17  0:19         ` Stan Hoeppner
2011-02-17  2:23           ` Roberto Spadim
2011-02-17  3:05             ` Stan Hoeppner
2011-02-17  0:26       ` David Brown
2011-02-17  0:45         ` Stan Hoeppner
2011-02-17 10:39           ` David Brown
2011-02-24 20:49     ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29   ` Roberto Spadim
2011-02-15 14:51     ` A. Krijgsman
2011-02-15 16:44       ` Roberto Spadim
2011-02-15 14:56     ` Zdenek Kaspar
2011-02-24 20:36       ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36   ` Roberto Spadim
2011-02-17 13:54     ` Roberto Spadim
2011-02-17 21:47   ` Stan Hoeppner
2011-02-17 22:13     ` Joe Landman
2011-02-17 23:49       ` Stan Hoeppner
2011-02-18  0:06         ` Joe Landman
2011-02-18  3:48           ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16   ` Stan Hoeppner
2011-02-21 10:25     ` Mattias Wadenstein
2011-02-21 21:51       ` Stan Hoeppner
2011-02-22  8:57         ` David Brown
2011-02-22  9:30           ` Mattias Wadenstein
2011-02-22  9:49             ` David Brown
2011-02-22 13:38           ` Stan Hoeppner
2011-02-22 14:18             ` David Brown
2011-02-23  5:52               ` Stan Hoeppner
2011-02-23 13:56                 ` David Brown
2011-02-23 14:25                   ` John Robinson
2011-02-23 15:15                     ` David Brown
2011-02-23 23:14                       ` Stan Hoeppner
2011-02-24 10:19                         ` David Brown
2011-02-23 21:59                     ` Stan Hoeppner
2011-02-23 23:43                       ` John Robinson
2011-02-24 15:53                         ` Stan Hoeppner
2011-02-23 21:11                   ` Stan Hoeppner
2011-02-24 11:24                     ` David Brown
2011-02-24 23:30                       ` Stan Hoeppner
2011-02-25  8:20                         ` David Brown
2011-02-19  0:24   ` Joe Landman
2011-02-21 10:04     ` Mattias Wadenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='AANLkTi=2k2=YuZAggonLfKmRFxFd-rXvNo=xkpqWyQNU@mail.gmail.com' \
    --to=roberto@spadim.com.br \
    --cc=drew.kay@gmail.com \
    --cc=hch@infradead.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=stan@hardwarefreak.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).