linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stan Hoeppner <stan@hardwarefreak.com>
To: David Brown <david.brown@hesbynett.no>
Cc: vincent Ferrer <vincentchicago1@gmail.com>, linux-raid@vger.kernel.org
Subject: Re: raid5 to utilize upto 8 cores
Date: Fri, 17 Aug 2012 02:15:55 -0500	[thread overview]
Message-ID: <502DEFAB.3060206@hardwarefreak.com> (raw)
In-Reply-To: <502CA6CE.1080105@hesbynett.no>

On 8/16/2012 2:52 AM, David Brown wrote:
> On 16/08/2012 07:58, Stan Hoeppner wrote:
>> On 8/15/2012 9:56 PM, vincent Ferrer wrote:
>>
>>> - My  storage server  has upto 8 cores  running linux kernel 2.6.32.27.
>>> - I created  a raid5 device of  10  SSDs .
>>> -  It seems  I only have single raid5 kernel thread,  limiting  my
>>> WRITE  throughput  to single cpu  core/thread.
>>
>> The single write threads of md/RAID5/6/10 are being addressed by patches
>> in development.  Read the list archives for progress/status.  There were
>> 3 posts to the list today regarding the RAID5 patch.
>>
>>> Question :   What are my options to make  my raid5 thread use all the
>>> CPU cores ?
>>>                    My SSDs  can do much more but  single raid5 thread
>>> from mdadm   is becoming the bottleneck.
>>>
>>> To overcome above single-thread-raid5 limitation (for now)  I 
>>> re-configured.
>>>       1)  I partitioned  all  my  10 SSDs into 8  partitions:
>>>       2)  I created  8   raid5 threads. Each raid5 thread having
>>> partition from each of the 8 SSDs
>>>       3)  My WRITE performance   quadrupled  because I have 8 RAID5
>>> threads.
>>> Question: Is this workaround a   normal practice  or may give me
>>> maintenance problems later on.
>>
>> No it is not normal practice.  I 'preach' against it regularly when I
>> see OPs doing it.  It's quite insane.  The glaring maintenance problem
>> is that when one SSD fails, and at least one will, you'll have 8 arrays
>> to rebuild vs one.  This may be acceptable to you, but not to the
>> general population.  With rust drives, and real workloads, it tends to
>> hammer the drive heads prodigiously, increasing latency and killing
>> performance, and decreasing drive life.  That's not an issue with SSD,
>> but multiple rebuilds is.  That and simply keeping track of 80
>> partitions.
>>
> 
> The rebuilds will, I believe, be done sequentially rather than in
> parallel.  And each rebuild will take 1/8 of the time a full array
> rebuild would have done.  So it really should not be much more time or
> wear-and-tear for a rebuild of this monster setup, compared to a single
> raid5 array rebuild.  (With hard disks, it would be worse due to head
> seeks - but still not as bad as you imply, if I am right about the
> rebuilds being done sequentially.)
> 
> However, there was a recent thread here about someone with a similar
> setup (on hard disks) who had a failure during such a rebuild and had
> lots of trouble.  That makes me sceptical to this sort of multiple array
> setup (in addition to Stan's other points).
> 
> And of course, all Stan's other points about maintenance, updates to
> later kernels with multiple raid5 threads, etc., still stand.
> 
>> There are a couple of sane things you can do today to address your
>> problem:
>>
>> 1.  Create a RAID50, a layered md/RAID0 over two 5 SSD md/RAID5 arrays.
>>   This will double your threads and your IOPS.  It won't be as fast as
>> your Frankenstein setup and you'll lose one SSD of capacity to
>> additional parity.  However, it's sane, stable, doubles your
>> performance, and you have only one array to rebuild after an SSD
>> failure.  Any filesystem will work well with it, including XFS if
>> aligned properly.  It gives you an easy upgrade path-- as soon as the
>> threaded patches hit, a simple kernel upgrade will give your two RAID5
>> arrays the extra threads, so you're simply out one SSD of capacity.  You
>> won't need to, and probably won't want to rebuild the entire thing after
>> the patch.  With the Frankenstein setup you'll be destroying and
>> rebuilding arrays.  And if these are consumer grade SSDs, you're much
>> better off having two drives worth of redundancy anyway, so a RAID50
>> makes good sense all around.
>>
>> 2.  Make 5 md/RAID1 mirrors and concatenate them with md/RAID linear.
>> You'll get one md write thread per RAID1 device utilizing 5 cores in
>> parallel.  The linear driver doesn't use threads, but passes offsets to
>> the block layer, allowing infinite core scaling.  Format the linear
>> device with XFS and mount with inode64.  XFS has been fully threaded for
>> 15 years.  Its allocation group design along with the inode64 allocator
>> allows near linear parallel scaling across a concatenated device[1],
>> assuming your workload/directory layout is designed for parallel file
>> throughput.
>>
>> #2, with a parallel write workload, may be competitive with your
>> Frankenstein setup in both IOPS and throughput, even with 3 fewer RAID
>> threads and 4 fewer SSD "spindles".  It will outrun the RAID50 setup
>> like it's standing still.  You'll lose half your capacity to redundancy
>> as with RAID10, but you'll have 5 write threads for md/RAID1, one per
>> SSD pair.  One core should be plenty to drive a single SSD mirror, with
>> plenty of cycles to spare for actual applications, while sparing 3 cores
>> for apps as well.  You'll get unlimited core scaling with both md/linear
>> and XFS.  This setup will yield the best balance of IOPS and throughput
>> performance for the amount of cycles burned on IO, compared to
>> Frankenstein and the RAID50.
> 
> For those that don't want to use XFS, or won't have balanced directories
> in their filesystem, or want greater throughput of larger files (rather
> than greater average throughput of multiple parallel accesses), you can
> also take your 5 raid1 mirror pairs and combine them with raid0.  You
> should get similar scaling (the cpu does not limit raid0).  For some
> applications (such as mail server, /home mount, etc.), the XFS over a
> linear concatenation is probably unbeatable.  But for others (such as
> serving large media files), a raid0 over raid1 pairs could well be
> better.  As always, it depends on your load - and you need to test with
> realistic loads or at least realistic simulations.

Sure, a homemade RAID10 would work as it avoids the md/RAID10 single
write thread.  I intentionally avoided mentioning this option for a few
reasons:

1.  Anyone needing 10 SATA SSDs obviously has a parallel workload
2.  Any thread will have up to 200-500MB/s available (one SSD)
    with a concat, I can't see a single thread needing 4.5GB/s of B/W
    If so, md/RAID isn't capable, not on COTS hardware
3.  With a parallel workload requiring this many SSDs, XFS is a must
4.  With a concat, mkfs.xfs is simple, no stripe aligning, etc
    ~$ mkfs.xfs /dev/md0

-- 
Stan


  parent reply	other threads:[~2012-08-17  7:15 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-16  2:56 raid5 to utilize upto 8 cores vincent Ferrer
2012-08-16  5:58 ` Stan Hoeppner
2012-08-16  7:03   ` Mikael Abrahamsson
2012-08-16  7:52   ` David Brown
2012-08-16 15:47     ` Flynn
2012-08-17  7:15     ` Stan Hoeppner [this message]
2012-08-17  7:29       ` David Brown
2012-08-17 10:52         ` Stan Hoeppner
2012-08-17 11:47           ` David Brown
2012-08-18  4:55             ` Stan Hoeppner
2012-08-18  8:59               ` David Brown
     [not found]   ` <CAEyJA_ungvS_o6dpKL+eghpavRwtY9eaDNCRJF0eUULoC0P6BA@mail.gmail.com>
2012-08-16  8:55     ` Stan Hoeppner
2012-08-16 22:11   ` vincent Ferrer
2012-08-17  7:52     ` David Brown
2012-08-17  8:29     ` Stan Hoeppner
     [not found] ` <CAD9gYJLwuai2kGw1D1wQoK8cOvMOiCCcN3hAY=k_jj0=4og3Vg@mail.gmail.com>
     [not found]   ` <CAEyJA_tGFtN2HMYa=vDV7m9N8thA-6MJ5TFo20X1yEpG3HQWYw@mail.gmail.com>
     [not found]     ` <CAD9gYJK09kRMb_v25uwmG7eRfFQLQyEd4SMXWBSPwYkpP56jcw@mail.gmail.com>
2012-08-16 21:51       ` vincent Ferrer
2012-08-16 22:29         ` Roberto Spadim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=502DEFAB.3060206@hardwarefreak.com \
    --to=stan@hardwarefreak.com \
    --cc=david.brown@hesbynett.no \
    --cc=linux-raid@vger.kernel.org \
    --cc=vincentchicago1@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).