From: Thomas Fjellstrom <thomas@fjellstrom.ca>
To: stan@hardwarefreak.com
Cc: Chris Murphy <lists@colorremedies.com>,
linux-raid Raid <linux-raid@vger.kernel.org>
Subject: Re: recommended way to add ssd cache to mdraid array
Date: Fri, 11 Jan 2013 05:35:12 -0700 [thread overview]
Message-ID: <201301110535.12512.thomas@fjellstrom.ca> (raw)
In-Reply-To: <50EF5A4B.7000502@hardwarefreak.com>
On Thu Jan 10, 2013, Stan Hoeppner wrote:
> On 1/10/2013 3:36 PM, Chris Murphy wrote:
> > On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
> >> A lot of it will be streaming. Some may end up being random read/writes.
> >> The test is just to gauge over all performance of the setup. 600MBs
> >> read is far more than I need, but having writes at 1/3 that seems odd
> >> to me.
> >
> > Tell us how many disks there are, and what the chunk size is. It could be
> > too small if you have too few disks which results in a small full stripe
> > size for a video context. If you're using the default, it could be too
> > big and you're getting a lot of RWM. Stan, and others, can better answer
> > this.
>
> Thomas is using a benchmark, and a single one at that, to judge the
> performance. He's not using his actual workloads. Tuning/tweaking to
> increase the numbers in a benchmark could be detrimental to actual
> performance instead of providing a boost. One must be careful.
>
> Regarding RAID6, it will always have horrible performance compared to
> non-parity RAID levels and even RAID5, for anything but full stripe
> aligned writes, which means writing new large files or doing large
> appends to existing files.
Considering its a rather simple use case, mostly streaming video, and misc
file sharing for my home network, an iozone test should be rather telling.
Especially the full test, from 4k up to 16mb
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
33554432 4 243295 221756 628767 624081 1028 4627 16822 7468777 17740 233295 231092 582036 579131
33554432 8 241134 225728 628264 627015 2027 8879 25977 10030302 19578 228923 233928 591478 584892
33554432 16 233758 228122 633406 618248 3952 13635 35676 10166457 19968 227599 229698 579267 576850
33554432 32 232390 219484 625968 625627 7604 18800 44252 10728450 24976 216880 222545 556513 555371
33554432 64 222936 206166 631659 627823 14112 22837 52259 11243595 30251 196243 192755 498602 494354
33554432 128 214740 182619 628604 626407 25088 26719 64912 11232068 39867 198638 185078 463505 467853
33554432 256 202543 185964 626614 624367 44363 34763 73939 10148251 62349 176724 191899 593517 595646
33554432 512 208081 188584 632188 629547 72617 39145 84876 9660408 89877 182736 172912 610681 608870
33554432 1024 196429 166125 630785 632413 116793 51904 133342 8687679 121956 168756 175225 620587 616722
33554432 2048 185399 167484 622180 627606 188571 70789 218009 5357136 370189 171019 166128 637830 637120
33554432 4096 198340 188695 632693 628225 289971 95211 278098 4836433 611529 161664 170469 665617 655268
33554432 8192 177919 167524 632030 629077 371602 115228 384030 4934570 618061 161562 176033 708542 709788
33554432 16384 196639 183744 631478 627518 485622 133467 462861 4890426 644615 175411 179795 725966 734364
> However, everything is relative. This RAID6 may have plenty of random
> and streaming write/read throughput for Thomas. But a single benchmark
> isn't going to inform him accurately.
200MB/s may be enough, but the difference between the read and write
throughput is a bit unexpected. It's not a weak machine (core i3-2120, dual
core 3.2Ghz with HT, 16GB ECC 1333Mhz ram), and this is basically all its
going to be doing.
> > You said these are unpartitioned disks, I think. In which case alignment
> > of 4096 byte sectors isn't a factor if these are AF disks.
> >
> > Unlikely to make up the difference is the scheduler. Parallel fs's like
> > XFS don't perform nearly as well with CFQ, so you should have a kernel
> > parameter elevator=noop.
>
> If the HBAs have [BB|FB]WC then one should probably use noop as the
> cache schedules the actual IO to the drives. If the HBAs lack cache,
> then deadline often provides better performance. Testing of each is
> required on a system and workload basis. With two identical systems
> (hardware/RAID/OS) one may perform better with noop, the other with
> deadline. The determining factor is the applications' IO patterns.
Mostly streaming reads, some long rsync's to copy stuff back and forth, file
share duties (downloads etc).
> > Another thing to look at is md/stripe_cache_size which probably needs to
> > be higher for your application.
> >
> > Another thing to look at is if you're using XFS, what your mount options
> > are. Invariably with an array of this size you need to be mounting with
> > the inode64 option.
>
> The desired allocator behavior is independent of array size but, once
> again, dependent on the workloads. inode64 is only needed for large
> filesystems with lots of files, where 1TB may not be enough for the
> directory inodes. Or, for mixed metadata/data heavy workloads.
>
> For many workloads including databases, video ingestion, etc, the
> inode32 allocator is preferred, regardless of array size. This is the
> linux-raid list so I'll not go into detail of the XFS allocators.
If you have the time and the desire, I'd like to hear about it off list.
> >> The reason I've selected RAID6 to begin with is I've read (on this
> >> mailing list, and on some hardware tech sites) that even with SAS
> >> drives, the rebuild/resync time on a large array using large disks
> >> (2TB+) is long enough that it gives more than enough time for another
> >> disk to hit a random read error,
> >
> > This is true for high density consumer SATA drives. It's not nearly as
> > applicable for low to moderate density nearline SATA which has an order
> > of magnitude lower UER, or for enterprise SAS (and some enterprise SATA)
> > which has yet another order of magnitude lower UER. So it depends on
> > the disks, and the RAID size, and the backup/restore strategy.
>
> Yes, enterprise drives have a much larger spare sector pool.
>
> WRT rebuild time, this is one more reason to use RAID10 or a concat of
> RAID1s. The rebuild time is low, constant, predictable. For 2TB drives
> about 5-6 hours at 100% rebuild rate. And rebuild time, for any array
> type, with gargantuan drives, is yet one more reason not to use the
> largest drives you can get your hands on. Using 1TB drives will cut
> that to 2.5-3 hours, and using 500GB drives will cut it down to 1.25-1.5
> hours, as all these drives tend to have similar streaming write rates.
>
> To wit, as a general rule I always build my arrays with the smallest
> drives I can get away with for the workload at hand. Yes, for a given
> TB total it increases acquisition cost of drives, HBAs, enclosures, and
> cables, and power consumption, but it also increases spindle count--thus
> performance-- while decreasing rebuild times substantially/dramatically.
I'd go raid10 or something if I had the space, but this little 10TB nas (which
is the goal, a small, quiet, not too slow, 10TB nas with some kind of
redundancy) only fits 7 3.5" HDDs.
Maybe sometime in the future I'll get a big 3 or 4 u case with a crap load of
3.5" HDD bays, but for now, this is what I have (as well as my old array,
7x1TB RAID5+XFS in 4in3 hot swap bays with room for 8 drives, but haven't
bothered to expand the old array, and I have the new one almost ready to go).
I don't know if it impacts anything at all, but when burning in these drives
after I bought them, I ran the same full iozone test a couple times, and each
drive shows 150MB/s read, and similar write times (100-120+?). It impressed me
somewhat, to see a mechanical hard drive go that fast. I remember back a few
years ago thinking 80MBs was fast for a HDD.
--
Thomas Fjellstrom
thomas@fjellstrom.ca
next prev parent reply other threads:[~2013-01-11 12:35 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-22 6:57 recommended way to add ssd cache to mdraid array Thomas Fjellstrom
2012-12-23 3:44 ` Thomas Fjellstrom
2013-01-09 18:41 ` Thomas Fjellstrom
2013-01-10 6:25 ` Chris Murphy
2013-01-10 10:49 ` Thomas Fjellstrom
2013-01-10 21:36 ` Chris Murphy
2013-01-11 0:18 ` Stan Hoeppner
2013-01-11 12:35 ` Thomas Fjellstrom [this message]
2013-01-11 12:48 ` Thomas Fjellstrom
2013-01-14 0:05 ` Tommy Apel Hansen
2013-01-14 8:58 ` Thomas Fjellstrom
2013-01-14 18:22 ` Thomas Fjellstrom
2013-01-14 19:45 ` Stan Hoeppner
2013-01-14 21:53 ` Thomas Fjellstrom
2013-01-14 22:51 ` Chris Murphy
2013-01-15 3:25 ` Thomas Fjellstrom
2013-01-15 1:50 ` Stan Hoeppner
2013-01-15 3:52 ` Thomas Fjellstrom
2013-01-15 8:38 ` Stan Hoeppner
2013-01-15 9:02 ` Tommy Apel
2013-01-15 11:19 ` Stan Hoeppner
2013-01-15 10:47 ` Tommy Apel
2013-01-16 5:31 ` Thomas Fjellstrom
2013-01-16 8:59 ` John Robinson
2013-01-16 21:29 ` Stan Hoeppner
2013-02-10 6:59 ` Thomas Fjellstrom
2013-01-16 22:06 ` Stan Hoeppner
2013-01-14 21:38 ` Tommy Apel Hansen
2013-01-14 21:47 ` Tommy Apel Hansen
2013-01-11 12:20 ` Thomas Fjellstrom
2013-01-11 17:39 ` Chris Murphy
2013-01-11 17:46 ` Chris Murphy
2013-01-11 18:52 ` Thomas Fjellstrom
2013-01-12 0:47 ` Phil Turmel
2013-01-12 3:56 ` Chris Murphy
2013-01-13 22:13 ` Phil Turmel
2013-01-13 23:20 ` Chris Murphy
2013-01-14 0:23 ` Phil Turmel
2013-01-14 3:58 ` Chris Murphy
2013-01-14 22:00 ` Thomas Fjellstrom
2013-01-11 18:51 ` Thomas Fjellstrom
2013-01-11 22:17 ` Stan Hoeppner
2013-01-12 2:44 ` Thomas Fjellstrom
2013-01-12 8:33 ` Stan Hoeppner
2013-01-12 14:44 ` Thomas Fjellstrom
2013-01-13 19:18 ` Chris Murphy
2013-01-14 9:06 ` Thomas Fjellstrom
2013-01-11 18:50 ` Stan Hoeppner
2013-01-12 2:45 ` Thomas Fjellstrom
2013-01-12 12:06 ` Roy Sigurd Karlsbakk
2013-01-12 14:14 ` Stan Hoeppner
2013-01-12 16:37 ` Roy Sigurd Karlsbakk
2013-01-10 13:13 ` Brad Campbell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201301110535.12512.thomas@fjellstrom.ca \
--to=thomas@fjellstrom.ca \
--cc=linux-raid@vger.kernel.org \
--cc=lists@colorremedies.com \
--cc=stan@hardwarefreak.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.