From: Bill Davidsen <davidsen@tmr.com>
To: Mark Hahn <hahn@physics.mcmaster.ca>
Cc: linux-raid@vger.kernel.org
Subject: Re: Odd (slow) RAID performance
Date: Tue, 19 Dec 2006 23:05:41 -0500 [thread overview]
Message-ID: <4588B695.4000203@tmr.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0612131227560.26047@coffee.psychology.mcmaster.ca>
Mark Hahn wrote:
>>>> which is right at the edge of what I need. I want to read the doc on
>>>> stripe_cache_size before going huge, if that's K 10MB is a LOT of
>>>> cache
>>>> when 256 works perfectly in RAID-0.
>
> but they are basically unrelated. in r5/6, the stripe cache is
> absolutely
> critical in caching parity chunks. in r0, never functions this way,
> though
> it may help some workloads a bit (IOs which aren't naturally aligned
> to the underlying disk layout.)
>
>>>> Any additional input appreciated, I would expect the speed to be
>>>> (Ndisk
>>>> - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't
>
> as others have reported, you can actually approach that with "naturally"
> aligned and sized writes.
I don't know what would be natural, I have three drives, 256 chunk size
and was originally testing with 1MB writes. I have a hard time seeing a
case where there would be a need to read-alter-rewrite, each chunk
should be writable as data1, data2, and parity, without readback. I was
writing directly to the array, so the data should start on a chunk
boundary. Until I went very large on stripe-cache-size performance was
almost exactly 100% the write speed of a single drive. There is no
obvious way to explain that other than writing one drive at a time. And
shrinking write size by factors of two resulted in decreasing
performance down to about 13% of the speed of a single drive. Such
performance just isn't useful, and going to RAID-10 eliminated the
problem, indicating that the RAID-5 implementation is the cause.
>
>> I'm doing the tests writing 2GB of data to the raw array, in 1MB
>> writes. The array is RAID-5 with 256 chunk size. I wouldn't really
>> expect any reads,
>
> but how many disks? if your 1M writes are to 4 data disks, you stand
> a chance of streaming (assuming your writes are naturally aligned, or
> else you'll be somewhat dependent on the stripe cache.)
> in other words, your whole-stripe size is ndisks*chunksize, and for
> 256K chunks and, say, 14 disks, that's pretty monstrous...
Three drives, so they could be totally isolated from other i/o.
>
> I think that's a factor often overlooked - large chunk sizes, especially
> with r5/6 AND lots of disks, mean you probably won't ever do "blind"
> updates, and thus need the r/m/w cycle. in that case, if the stripe
> cache
> is not big/smart enough, you'll be limited by reads.
I didn't have lots of disks, and when the data and parity are all being
updated in full chunk increments, there's no reason for a read, since
the data won't be needed. I agree that it's probably being read, but
needlessly.
>
> I'd like to experiment with this, to see how much benefit you really
> get from using larger chunk sizes. I'm guessing that past 32K
> or so, normal *ata systems don't speedup much. fabrics with higher
> latency or command/arbitration overhead would want larger chunks.
>
>> tried was 2K blocks, so I can try other sizes. I have a hard time
>> picturing why smaller sizes would be better, but that's what testing
>> is for.
>
> larger writes (from user-space) generally help, probably up to MB's.
> smaller chunks help by making it more likley to do blind parity updates;
> a larger stripe cache can help that too.
I tried 256B to 1MB sizes, 1MB was best, or more correctly least
unacceptable.
>
> I think I recall an earlier thread regarding how the stripe cache is used
> somewhat naively - that all IO goes through it. the most important
> blocks would be parity and "ends" of a write that partially update an
> underlying chunk. (conversely, don't bother caching anything which
> can be blindly written to disk.)
I fear that last parenthetical isn't being observed.
If it weren't for RAID-1 and RAID-10 being fast I wouldn't complain
about RAID-5.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
next prev parent reply other threads:[~2006-12-20 4:05 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-11-30 14:13 Odd (slow) RAID performance Bill Davidsen
2006-11-30 14:31 ` Roger Lucas
2006-11-30 15:30 ` Bill Davidsen
2006-11-30 15:32 ` Roger Lucas
2006-11-30 21:09 ` Bill Davidsen
2006-12-01 9:24 ` Roger Lucas
2006-12-02 5:27 ` Bill Davidsen
2006-12-05 1:33 ` Dan Williams
2006-12-07 15:51 ` Bill Davidsen
2006-12-08 1:15 ` Corey Hickey
2006-12-08 8:21 ` Gabor Gombas
2006-12-08 6:01 ` Neil Brown
2006-12-08 7:28 ` Neil Brown
2006-12-09 20:20 ` Bill Davidsen
2006-12-12 17:44 ` Bill Davidsen
2006-12-12 18:48 ` Raz Ben-Jehuda(caro)
2006-12-12 21:51 ` Bill Davidsen
2006-12-13 17:44 ` Mark Hahn
2006-12-20 4:05 ` Bill Davidsen [this message]
2006-12-09 20:16 ` Bill Davidsen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4588B695.4000203@tmr.com \
--to=davidsen@tmr.com \
--cc=hahn@physics.mcmaster.ca \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).