linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bill Davidsen <davidsen@tmr.com>
To: Dan Williams <dan.j.williams@gmail.com>
Cc: Roger Lucas <roger@planbit.co.uk>,
	linux-raid@vger.kernel.org, neilb@suse.de
Subject: Re: Odd (slow) RAID performance
Date: Thu, 07 Dec 2006 10:51:25 -0500	[thread overview]
Message-ID: <4578387D.4010209@tmr.com> (raw)
In-Reply-To: <e9c3a7c20612041733h5999ac0erc7ab4a7174882683@mail.gmail.com>

Dan Williams wrote:
> On 12/1/06, Bill Davidsen <davidsen@tmr.com> wrote:
>> Thank you so much for verifying this. I do keep enough room on my drives
>> to run tests by creating any kind of whatever I need, but the point is
>> clear: with N drives striped the transfer rate is N x base rate of one
>> drive; with RAID-5 it is about the speed of one drive, suggesting that
>> the md code serializes writes.
>>
>> If true, BOO, HISS!
>>
>> Can you explain and educate us, Neal? This look like terrible 
>> performance.
>>
> Just curious what is your stripe_cache_size setting in sysfs?
> 
> Neil, please include me in the education if what follows is incorrect:
> 
> Read performance in kernels up to and including 2.6.19 is hindered by
> needing to go through the stripe cache.  This situation should improve
> with the stripe-cache-bypass patches currently in -mm.  As Raz
> reported in some cases the performance increase of this approach is
> 30% which is roughly equivalent to the performance difference I see of
> a 4-disk raid5 versus a 3-disk raid0.
> 
> For the write case I can say that MD does not serialize writes.  If by
> serialize you mean that there is 1:1 correlation between writes to the
> parity disk and writes to a data disk.  To illustrate I instrumented
> MD to count how many times it issued a write to the parity disk and
> compared that to how many writes it performed to the member disks for
> the workload "dd if=/dev/zero of=/dev/md0 bs=1024k count=100".  I
> recorded 8544 parity writes and 25600 member disk writes which is
> about 3 member disk writes per parity write, or pretty close to
> optimal for a 4-disk array.  So, serialization is not the cause,
> performing sub-stripe width writes is not the cause as >98% of the
> writes happened without needing to read old data from the disks.
> However, I see the same performance on my system, about equal to a
> single disk.

But the number of writes isn't an indication of serialization. If I 
write disk A, then B, then C, then D, you can't tell if I waited for 
each write to finish before starting the next, or did them in parallel. 
And since the write speed is equal to the speed of a single drive, 
effectively that's what happens, even though I can't see it in the code.

I also suspect that write are not being combined, since writing the 2GB 
test runs at one-drive speed writing 1MB blocks, but floppy speed 
writing 2k blocks. And no, I'm not running out of CPU to do the 
overhead, it jumps from 2-4% to 30% of one CPU, but on an unloaded SMP 
system it's not CPU bound.
> 
> Here is where I step into supposition territory.  Perhaps the
> discrepancy is related to the size of the requests going to the block
> layer.  raid5 always makes page sized requests with the expectation
> that they will coalesce into larger requests in the block layer.
> Maybe we are missing coalescing opportunities in raid5 compared to
> what happens in the raid0 case?  Are there any io scheduler knobs to
> turn along these lines?

Good thought, I had already tried that but not reported it, changing 
schedulers make no significant difference. In the range of 2-3%, which 
is close to the measurement jitter due to head position or whatever.

I changed my swap to RAID-10, but RAID-5 just can't keep up with 
70-100MB/s data bursts which I need. I'm probably going to scrap 
software RAID and go back to a controller, the write speeds are simply 
not even close to what they should be. I have one more thing to try, a 
tool I wrote to chase another problem a few years ago. I'll report if I 
find something.

-- 
bill davidsen <davidsen@tmr.com>
   CTO TMR Associates, Inc
   Doing interesting things with small computers since 1979

  reply	other threads:[~2006-12-07 15:51 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-11-30 14:13 Odd (slow) RAID performance Bill Davidsen
2006-11-30 14:31 ` Roger Lucas
2006-11-30 15:30   ` Bill Davidsen
2006-11-30 15:32     ` Roger Lucas
2006-11-30 21:09       ` Bill Davidsen
2006-12-01  9:24         ` Roger Lucas
2006-12-02  5:27           ` Bill Davidsen
2006-12-05  1:33             ` Dan Williams
2006-12-07 15:51               ` Bill Davidsen [this message]
2006-12-08  1:15                 ` Corey Hickey
2006-12-08  8:21                 ` Gabor Gombas
2006-12-08  6:01               ` Neil Brown
2006-12-08  7:28                 ` Neil Brown
2006-12-09 20:20                   ` Bill Davidsen
2006-12-12 17:44                   ` Bill Davidsen
2006-12-12 18:48                     ` Raz Ben-Jehuda(caro)
2006-12-12 21:51                       ` Bill Davidsen
2006-12-13 17:44                         ` Mark Hahn
2006-12-20  4:05                           ` Bill Davidsen
2006-12-09 20:16                 ` Bill Davidsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4578387D.4010209@tmr.com \
    --to=davidsen@tmr.com \
    --cc=dan.j.williams@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=roger@planbit.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).