From: Stan Hoeppner <stan@hardwarefreak.com>
To: David Brown <david.brown@hesbynett.no>
Cc: vincent Ferrer <vincentchicago1@gmail.com>, linux-raid@vger.kernel.org
Subject: Re: raid5 to utilize upto 8 cores
Date: Fri, 17 Aug 2012 02:15:55 -0500 [thread overview]
Message-ID: <502DEFAB.3060206@hardwarefreak.com> (raw)
In-Reply-To: <502CA6CE.1080105@hesbynett.no>
On 8/16/2012 2:52 AM, David Brown wrote:
> On 16/08/2012 07:58, Stan Hoeppner wrote:
>> On 8/15/2012 9:56 PM, vincent Ferrer wrote:
>>
>>> - My storage server has upto 8 cores running linux kernel 2.6.32.27.
>>> - I created a raid5 device of 10 SSDs .
>>> - It seems I only have single raid5 kernel thread, limiting my
>>> WRITE throughput to single cpu core/thread.
>>
>> The single write threads of md/RAID5/6/10 are being addressed by patches
>> in development. Read the list archives for progress/status. There were
>> 3 posts to the list today regarding the RAID5 patch.
>>
>>> Question : What are my options to make my raid5 thread use all the
>>> CPU cores ?
>>> My SSDs can do much more but single raid5 thread
>>> from mdadm is becoming the bottleneck.
>>>
>>> To overcome above single-thread-raid5 limitation (for now) I
>>> re-configured.
>>> 1) I partitioned all my 10 SSDs into 8 partitions:
>>> 2) I created 8 raid5 threads. Each raid5 thread having
>>> partition from each of the 8 SSDs
>>> 3) My WRITE performance quadrupled because I have 8 RAID5
>>> threads.
>>> Question: Is this workaround a normal practice or may give me
>>> maintenance problems later on.
>>
>> No it is not normal practice. I 'preach' against it regularly when I
>> see OPs doing it. It's quite insane. The glaring maintenance problem
>> is that when one SSD fails, and at least one will, you'll have 8 arrays
>> to rebuild vs one. This may be acceptable to you, but not to the
>> general population. With rust drives, and real workloads, it tends to
>> hammer the drive heads prodigiously, increasing latency and killing
>> performance, and decreasing drive life. That's not an issue with SSD,
>> but multiple rebuilds is. That and simply keeping track of 80
>> partitions.
>>
>
> The rebuilds will, I believe, be done sequentially rather than in
> parallel. And each rebuild will take 1/8 of the time a full array
> rebuild would have done. So it really should not be much more time or
> wear-and-tear for a rebuild of this monster setup, compared to a single
> raid5 array rebuild. (With hard disks, it would be worse due to head
> seeks - but still not as bad as you imply, if I am right about the
> rebuilds being done sequentially.)
>
> However, there was a recent thread here about someone with a similar
> setup (on hard disks) who had a failure during such a rebuild and had
> lots of trouble. That makes me sceptical to this sort of multiple array
> setup (in addition to Stan's other points).
>
> And of course, all Stan's other points about maintenance, updates to
> later kernels with multiple raid5 threads, etc., still stand.
>
>> There are a couple of sane things you can do today to address your
>> problem:
>>
>> 1. Create a RAID50, a layered md/RAID0 over two 5 SSD md/RAID5 arrays.
>> This will double your threads and your IOPS. It won't be as fast as
>> your Frankenstein setup and you'll lose one SSD of capacity to
>> additional parity. However, it's sane, stable, doubles your
>> performance, and you have only one array to rebuild after an SSD
>> failure. Any filesystem will work well with it, including XFS if
>> aligned properly. It gives you an easy upgrade path-- as soon as the
>> threaded patches hit, a simple kernel upgrade will give your two RAID5
>> arrays the extra threads, so you're simply out one SSD of capacity. You
>> won't need to, and probably won't want to rebuild the entire thing after
>> the patch. With the Frankenstein setup you'll be destroying and
>> rebuilding arrays. And if these are consumer grade SSDs, you're much
>> better off having two drives worth of redundancy anyway, so a RAID50
>> makes good sense all around.
>>
>> 2. Make 5 md/RAID1 mirrors and concatenate them with md/RAID linear.
>> You'll get one md write thread per RAID1 device utilizing 5 cores in
>> parallel. The linear driver doesn't use threads, but passes offsets to
>> the block layer, allowing infinite core scaling. Format the linear
>> device with XFS and mount with inode64. XFS has been fully threaded for
>> 15 years. Its allocation group design along with the inode64 allocator
>> allows near linear parallel scaling across a concatenated device[1],
>> assuming your workload/directory layout is designed for parallel file
>> throughput.
>>
>> #2, with a parallel write workload, may be competitive with your
>> Frankenstein setup in both IOPS and throughput, even with 3 fewer RAID
>> threads and 4 fewer SSD "spindles". It will outrun the RAID50 setup
>> like it's standing still. You'll lose half your capacity to redundancy
>> as with RAID10, but you'll have 5 write threads for md/RAID1, one per
>> SSD pair. One core should be plenty to drive a single SSD mirror, with
>> plenty of cycles to spare for actual applications, while sparing 3 cores
>> for apps as well. You'll get unlimited core scaling with both md/linear
>> and XFS. This setup will yield the best balance of IOPS and throughput
>> performance for the amount of cycles burned on IO, compared to
>> Frankenstein and the RAID50.
>
> For those that don't want to use XFS, or won't have balanced directories
> in their filesystem, or want greater throughput of larger files (rather
> than greater average throughput of multiple parallel accesses), you can
> also take your 5 raid1 mirror pairs and combine them with raid0. You
> should get similar scaling (the cpu does not limit raid0). For some
> applications (such as mail server, /home mount, etc.), the XFS over a
> linear concatenation is probably unbeatable. But for others (such as
> serving large media files), a raid0 over raid1 pairs could well be
> better. As always, it depends on your load - and you need to test with
> realistic loads or at least realistic simulations.
Sure, a homemade RAID10 would work as it avoids the md/RAID10 single
write thread. I intentionally avoided mentioning this option for a few
reasons:
1. Anyone needing 10 SATA SSDs obviously has a parallel workload
2. Any thread will have up to 200-500MB/s available (one SSD)
with a concat, I can't see a single thread needing 4.5GB/s of B/W
If so, md/RAID isn't capable, not on COTS hardware
3. With a parallel workload requiring this many SSDs, XFS is a must
4. With a concat, mkfs.xfs is simple, no stripe aligning, etc
~$ mkfs.xfs /dev/md0
--
Stan
next prev parent reply other threads:[~2012-08-17 7:15 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-08-16 2:56 raid5 to utilize upto 8 cores vincent Ferrer
2012-08-16 5:58 ` Stan Hoeppner
2012-08-16 7:03 ` Mikael Abrahamsson
2012-08-16 7:52 ` David Brown
2012-08-16 15:47 ` Flynn
2012-08-17 7:15 ` Stan Hoeppner [this message]
2012-08-17 7:29 ` David Brown
2012-08-17 10:52 ` Stan Hoeppner
2012-08-17 11:47 ` David Brown
2012-08-18 4:55 ` Stan Hoeppner
2012-08-18 8:59 ` David Brown
[not found] ` <CAEyJA_ungvS_o6dpKL+eghpavRwtY9eaDNCRJF0eUULoC0P6BA@mail.gmail.com>
2012-08-16 8:55 ` Stan Hoeppner
2012-08-16 22:11 ` vincent Ferrer
2012-08-17 7:52 ` David Brown
2012-08-17 8:29 ` Stan Hoeppner
[not found] ` <CAD9gYJLwuai2kGw1D1wQoK8cOvMOiCCcN3hAY=k_jj0=4og3Vg@mail.gmail.com>
[not found] ` <CAEyJA_tGFtN2HMYa=vDV7m9N8thA-6MJ5TFo20X1yEpG3HQWYw@mail.gmail.com>
[not found] ` <CAD9gYJK09kRMb_v25uwmG7eRfFQLQyEd4SMXWBSPwYkpP56jcw@mail.gmail.com>
2012-08-16 21:51 ` vincent Ferrer
2012-08-16 22:29 ` Roberto Spadim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=502DEFAB.3060206@hardwarefreak.com \
--to=stan@hardwarefreak.com \
--cc=david.brown@hesbynett.no \
--cc=linux-raid@vger.kernel.org \
--cc=vincentchicago1@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).