RAID-5 streaming read performance

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID-5 streaming read performance
@ 2005-07-11 15:11 Dan Christensen
  2005-07-13  2:08 ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Christensen @ 2005-07-11 15:11 UTC (permalink / raw)
  To: linux-raid

I was wondering what I should expect in terms of streaming read
performance when using (software) RAID-5 with four SATA drives.  I
thought I would get a noticeable improvement compared to reads from a
single device, but that's not the case.  I tested this by using dd to
read 300MB directly from disk partitions /dev/sda7, etc, and also using
dd to read 300MB directly from the raid device (/dev/md2 in this case).
I get around 57MB/s from each of the disk partitions that make up the
raid device, and about 58MB/s from the raid device.  On the other
hand, if I run parallel reads from the component partitions, I get
25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s.

Before each read, I try to clear the kernel's cache by reading
900MB from an unrelated partition on the disk.  (Is this guaranteed
to work?  Is there a better and/or faster way to clear cache?)

I have AAM quiet mode/low performance enabled on the drives, but (a)
this shouldn't matter too much for streaming reads, and (b) it's the
relative performance of the reads from the partitions and the RAID
device that I'm curious about.

I also get poor write performance, but that's harder to isolate
because I have to go through the lvm and filesystem layers too.

I also get poor performance from my RAID-1 array and my other
RAID-5 arrays.

Details of my tests and set-up below.

Thanks for any suggestions,

Dan

System:
- Athlon 2500+
- kernel 2.6.12.2 (also tried 2.6.11.11)
- four SATA drives (3 160G, 1 200G); Samsung Spinpoint
- SiI3114 controller (latency_timer=32 by default; tried 128 too)
- 1G ram
- blockdev --getra /dev/sda  -->  256   (didn't play with these)
- blockdev --getra /dev/md2  -->  768   (didn't play with this)
- tried anticipatory, deadline and cfq schedules, with no significant
  difference.
- machine essentially idle during tests

Here is part of /proc/mdstat (the full output is below):

md2 : active raid5 sdd5[3] sdc5[2] sdb5[1] sda7[0]
      218612160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

Here's the test script and output:

# Clear cache:
dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1
for f in sda7 sdb5 sdc5 sdd5 ; do 
  echo $f
  dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec
  echo
done

# Clear cache:
dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1
for f in md2 ; do 
  echo $f
  dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec
  echo
done

Output:

sda7
314572800 bytes transferred in 5.401071 seconds (58242671 bytes/sec)

sdb5
314572800 bytes transferred in 5.621170 seconds (55962158 bytes/sec)

sdc5
314572800 bytes transferred in 5.635491 seconds (55819947 bytes/sec)

sdd5
314572800 bytes transferred in 5.333374 seconds (58981951 bytes/sec)

md2
314572800 bytes transferred in 5.386627 seconds (58398846 bytes/sec)

# cat /proc/mdstat
md1 : active raid5 sdd1[2] sdc1[1] sda2[0]
      578048 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md4 : active raid5 sdd2[3] sdc2[2] sdb2[1] sda6[0]
      30748032 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md2 : active raid5 sdd5[3] sdc5[2] sdb5[1] sda7[0]
      218612160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md3 : active raid5 sdd6[3] sdc6[2] sdb6[1] sda8[0]
      218636160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md0 : active raid1 sdb1[0] sda5[1]
      289024 blocks [2/2] [UU]

# mdadm --detail /dev/md2
/dev/md2:
        Version : 00.90.01
  Creation Time : Mon Jul  4 23:54:34 2005
     Raid Level : raid5
     Array Size : 218612160 (208.48 GiB 223.86 GB)
    Device Size : 72870720 (69.49 GiB 74.62 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Thu Jul  7 21:52:50 2005
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : c4056d19:7b4bb550:44925b88:91d5bc8a
         Events : 0.10873823

    Number   Major   Minor   RaidDevice State
       0       8        7        0      active sync   /dev/sda7
       1       8       21        1      active sync   /dev/sdb5
       2       8       37        2      active sync   /dev/sdc5
       3       8       53        3      active sync   /dev/sdd5

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-11 15:11 RAID-5 streaming read performance Dan Christensen
@ 2005-07-13  2:08 ` Ming Zhang
  2005-07-13  2:52   ` Dan Christensen
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-13  2:08 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

On Mon, 2005-07-11 at 11:11 -0400, Dan Christensen wrote:
> I was wondering what I should expect in terms of streaming read
> performance when using (software) RAID-5 with four SATA drives.  I
> thought I would get a noticeable improvement compared to reads from a
> single device, but that's not the case.  I tested this by using dd to
> read 300MB directly from disk partitions /dev/sda7, etc, and also using
> dd to read 300MB directly from the raid device (/dev/md2 in this case).
> I get around 57MB/s from each of the disk partitions that make up the
> raid device, and about 58MB/s from the raid device.  On the other
> hand, if I run parallel reads from the component partitions, I get
> 25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s.
> 
> Before each read, I try to clear the kernel's cache by reading
> 900MB from an unrelated partition on the disk.  (Is this guaranteed
> to work?  Is there a better and/or faster way to clear cache?)
> 
> I have AAM quiet mode/low performance enabled on the drives, but (a)
> this shouldn't matter too much for streaming reads, and (b) it's the
> relative performance of the reads from the partitions and the RAID
> device that I'm curious about.
> 
> I also get poor write performance, but that's harder to isolate
> because I have to go through the lvm and filesystem layers too.
> 
> I also get poor performance from my RAID-1 array and my other
> RAID-5 arrays.
> 
> Details of my tests and set-up below.
> 
> Thanks for any suggestions,
> 
> Dan
> 
> 
> System:
> - Athlon 2500+
> - kernel 2.6.12.2 (also tried 2.6.11.11)
> - four SATA drives (3 160G, 1 200G); Samsung Spinpoint
> - SiI3114 controller (latency_timer=32 by default; tried 128 too)
only 1 card? 4 port? try some other brand card and try to use several
cards at the same time. i met some poor cards before.



> - 1G ram
> - blockdev --getra /dev/sda  -->  256   (didn't play with these)
> - blockdev --getra /dev/md2  -->  768   (didn't play with this)
> - tried anticipatory, deadline and cfq schedules, with no significant
>   difference.
> - machine essentially idle during tests
> 
> Here is part of /proc/mdstat (the full output is below):
> 
> md2 : active raid5 sdd5[3] sdc5[2] sdb5[1] sda7[0]
>       218612160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>       
> Here's the test script and output:
> 
> # Clear cache:
> dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1
> for f in sda7 sdb5 sdc5 sdd5 ; do 
>   echo $f
>   dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec
>   echo
> done
> 
> # Clear cache:
> dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1
> for f in md2 ; do 
>   echo $f
>   dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec
>   echo
> done
> 
> Output:
> 
> sda7
> 314572800 bytes transferred in 5.401071 seconds (58242671 bytes/sec)
> 
> sdb5
> 314572800 bytes transferred in 5.621170 seconds (55962158 bytes/sec)
> 
> sdc5
> 314572800 bytes transferred in 5.635491 seconds (55819947 bytes/sec)
> 
> sdd5
> 314572800 bytes transferred in 5.333374 seconds (58981951 bytes/sec)
> 
> md2
> 314572800 bytes transferred in 5.386627 seconds (58398846 bytes/sec)
> 
> # cat /proc/mdstat
> md1 : active raid5 sdd1[2] sdc1[1] sda2[0]
>       578048 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>       
> md4 : active raid5 sdd2[3] sdc2[2] sdb2[1] sda6[0]
>       30748032 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>       
> md2 : active raid5 sdd5[3] sdc5[2] sdb5[1] sda7[0]
>       218612160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>       
> md3 : active raid5 sdd6[3] sdc6[2] sdb6[1] sda8[0]
>       218636160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>       
> md0 : active raid1 sdb1[0] sda5[1]
>       289024 blocks [2/2] [UU]
> 
> # mdadm --detail /dev/md2
> /dev/md2:
>         Version : 00.90.01
>   Creation Time : Mon Jul  4 23:54:34 2005
>      Raid Level : raid5
>      Array Size : 218612160 (208.48 GiB 223.86 GB)
>     Device Size : 72870720 (69.49 GiB 74.62 GB)
>    Raid Devices : 4
>   Total Devices : 4
> Preferred Minor : 2
>     Persistence : Superblock is persistent
> 
>     Update Time : Thu Jul  7 21:52:50 2005
>           State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>            UUID : c4056d19:7b4bb550:44925b88:91d5bc8a
>          Events : 0.10873823
> 
>     Number   Major   Minor   RaidDevice State
>        0       8        7        0      active sync   /dev/sda7
>        1       8       21        1      active sync   /dev/sdb5
>        2       8       37        2      active sync   /dev/sdc5
>        3       8       53        3      active sync   /dev/sdd5
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13  2:08 ` Ming Zhang
@ 2005-07-13  2:52   ` Dan Christensen
  2005-07-13  3:15     ` berk walker
  2005-07-13 12:24     ` Ming Zhang
  0 siblings, 2 replies; 41+ messages in thread
From: Dan Christensen @ 2005-07-13  2:52 UTC (permalink / raw)
  To: mingz; +Cc: Linux RAID

Ming Zhang <mingz@ele.uri.edu> writes:

> On Mon, 2005-07-11 at 11:11 -0400, Dan Christensen wrote:
>> I was wondering what I should expect in terms of streaming read
>> performance when using (software) RAID-5 with four SATA drives.  I
>> thought I would get a noticeable improvement compared to reads from a
>> single device, but that's not the case.  I tested this by using dd to
>> read 300MB directly from disk partitions /dev/sda7, etc, and also using
>> dd to read 300MB directly from the raid device (/dev/md2 in this case).
>> I get around 57MB/s from each of the disk partitions that make up the
>> raid device, and about 58MB/s from the raid device.  On the other
>> hand, if I run parallel reads from the component partitions, I get
>> 25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s.
>> 
>> [...]
>> 
>> System:
>> - Athlon 2500+
>> - kernel 2.6.12.2 (also tried 2.6.11.11)
>> - four SATA drives (3 160G, 1 200G); Samsung Spinpoint
>> - SiI3114 controller (latency_timer=32 by default; tried 128 too)
> 
> only 1 card? 4 port? try some other brand card and try to use several
> cards at the same time. i met some poor cards before.

Yes, one 4-port controller.  It's on the motherboard.

I thought that since I get good throughput doing parallel reads from
the four drives (see above) that would eliminate the controller as the
bottleneck.  Am I wrong?

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13  2:52   ` Dan Christensen
@ 2005-07-13  3:15     ` berk walker
  2005-07-13 12:24     ` Ming Zhang
  1 sibling, 0 replies; 41+ messages in thread
From: berk walker @ 2005-07-13  3:15 UTC (permalink / raw)
  To: Dan Christensen; +Cc: mingz, Linux RAID



Dan Christensen wrote:

>Ming Zhang <mingz@ele.uri.edu> writes:
>
>  
>
>>On Mon, 2005-07-11 at 11:11 -0400, Dan Christensen wrote:
>>    
>>
>>>I was wondering what I should expect in terms of streaming read
>>>performance when using (software) RAID-5 with four SATA drives.  I
>>>thought I would get a noticeable improvement compared to reads from a
>>>single device, but that's not the case.  I tested this by using dd to
>>>read 300MB directly from disk partitions /dev/sda7, etc, and also using
>>>dd to read 300MB directly from the raid device (/dev/md2 in this case).
>>>I get around 57MB/s from each of the disk partitions that make up the
>>>raid device, and about 58MB/s from the raid device.  On the other
>>>hand, if I run parallel reads from the component partitions, I get
>>>25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s.
>>>
>>>[...]
>>>
>>>System:
>>>- Athlon 2500+
>>>- kernel 2.6.12.2 (also tried 2.6.11.11)
>>>- four SATA drives (3 160G, 1 200G); Samsung Spinpoint
>>>- SiI3114 controller (latency_timer=32 by default; tried 128 too)
>>>      
>>>
>>only 1 card? 4 port? try some other brand card and try to use several
>>cards at the same time. i met some poor cards before.
>>    
>>
>
>Yes, one 4-port controller.  It's on the motherboard.
>
>I thought that since I get good throughput doing parallel reads from
>the four drives (see above) that would eliminate the controller as the
>bottleneck.  Am I wrong?
>
>Dan
>  
>
Slavery was abolished in the 1800's.
b-

>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>.
>
>  
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13  2:52   ` Dan Christensen
  2005-07-13  3:15     ` berk walker
@ 2005-07-13 12:24     ` Ming Zhang
  2005-07-13 12:48       ` Dan Christensen
  1 sibling, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-13 12:24 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

On Tue, 2005-07-12 at 22:52 -0400, Dan Christensen wrote:
> Ming Zhang <mingz@ele.uri.edu> writes:
> 
> > On Mon, 2005-07-11 at 11:11 -0400, Dan Christensen wrote:
> >> I was wondering what I should expect in terms of streaming read
> >> performance when using (software) RAID-5 with four SATA drives.  I
> >> thought I would get a noticeable improvement compared to reads from a
> >> single device, but that's not the case.  I tested this by using dd to
> >> read 300MB directly from disk partitions /dev/sda7, etc, and also using
> >> dd to read 300MB directly from the raid device (/dev/md2 in this case).
> >> I get around 57MB/s from each of the disk partitions that make up the
> >> raid device, and about 58MB/s from the raid device.  On the other
> >> hand, if I run parallel reads from the component partitions, I get
> >> 25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s.
> >> 
> >> [...]
> >> 
> >> System:
> >> - Athlon 2500+
> >> - kernel 2.6.12.2 (also tried 2.6.11.11)
> >> - four SATA drives (3 160G, 1 200G); Samsung Spinpoint
> >> - SiI3114 controller (latency_timer=32 by default; tried 128 too)
> > 
> > only 1 card? 4 port? try some other brand card and try to use several
> > cards at the same time. i met some poor cards before.
> 
> Yes, one 4-port controller.  It's on the motherboard.
> 
> I thought that since I get good throughput doing parallel reads from
> the four drives (see above) that would eliminate the controller as the
> bottleneck.  Am I wrong?
> 

have u try the parallel write?


> Dan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 12:24     ` Ming Zhang
@ 2005-07-13 12:48       ` Dan Christensen
  2005-07-13 12:52         ` Ming Zhang
  2005-07-13 22:42         ` Neil Brown
  0 siblings, 2 replies; 41+ messages in thread
From: Dan Christensen @ 2005-07-13 12:48 UTC (permalink / raw)
  To: mingz; +Cc: Linux RAID

Ming Zhang <mingz@ele.uri.edu> writes:

> have u try the parallel write?

I haven't tested it as thoroughly, as it brings lvm and the filesystem
into the mix.  (The disks are in "production" use, and are fairly
full, so I can't do writes directly to the disk partitions/raid
device.)

My preliminary finding is that raid writes are faster than non-raid
writes:  49MB/s vs 39MB/s.  Still not stellar performance, though.
Question for the list:  if I'm doing a long sequential write, naively
each parity block will get recalculated and rewritten several times,
once for each non-parity block in the stripe.  Does the write-caching
that the kernel does mean that each parity block will only get written
once?

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 12:48       ` Dan Christensen
@ 2005-07-13 12:52         ` Ming Zhang
  2005-07-13 14:23           ` Dan Christensen
  2005-07-13 22:42         ` Neil Brown
  1 sibling, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-13 12:52 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

On Wed, 2005-07-13 at 08:48 -0400, Dan Christensen wrote:
> Ming Zhang <mingz@ele.uri.edu> writes:
> 
> > have u try the parallel write?
> 
> I haven't tested it as thoroughly, as it brings lvm and the filesystem
> into the mix.  (The disks are in "production" use, and are fairly
> full, so I can't do writes directly to the disk partitions/raid
> device.)
test on a production environment is too dangerous. :P
and many benchmark tool u can not perform as well.

LVM overhead is small, but file system overhead is hard to say.

> 
> My preliminary finding is that raid writes are faster than non-raid
> writes:  49MB/s vs 39MB/s.  Still not stellar performance, though.
> Question for the list:  if I'm doing a long sequential write, naively
> each parity block will get recalculated and rewritten several times,
> once for each non-parity block in the stripe.  Does the write-caching
> that the kernel does mean that each parity block will only get written
> once?
> 
if you write sequential, you might see a stripe write thus write only
once.

but if you write on file system and file system has meta data write, log
write, then things become complicated. you can use iostat to see r/w on
your disk.

> Dan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 12:52         ` Ming Zhang
@ 2005-07-13 14:23           ` Dan Christensen
  2005-07-13 14:29             ` Ming Zhang
  2005-07-13 18:02             ` David Greaves
  0 siblings, 2 replies; 41+ messages in thread
From: Dan Christensen @ 2005-07-13 14:23 UTC (permalink / raw)
  To: mingz; +Cc: Linux RAID

Ming Zhang <mingz@ele.uri.edu> writes:

> test on a production environment is too dangerous. :P
> and many benchmark tool u can not perform as well.

Well, I put "production" in quotes because this is just a home mythtv
box.  :-)  So there are plenty of times when it is idle and I can do
benchmarks.  But I can't erase the hard drives in my tests.

> LVM overhead is small, but file system overhead is hard to say.

I expected LVM overhead to be small, but in my tests it is very high.
I plan to discuss this on the lvm mailing list after I've got the RAID
working as well as possible, but as an example:

Streaming reads using dd to /dev/null:

component partitions, e.g. /dev/sda7: 58MB/s
raid device /dev/md2:                 59MB/s
lvm device /dev/main/media:           34MB/s

So something is seriously wrong with my lvm set-up (or with lvm).  The
lvm device is linearly mapped to the initial blocks of md2, so the
last two tests should be reading the same blocks from disk.

>> My preliminary finding is that raid writes are faster than non-raid
>> writes:  49MB/s vs 39MB/s.  Still not stellar performance, though.
>> Question for the list:  if I'm doing a long sequential write, naively
>> each parity block will get recalculated and rewritten several times,
>> once for each non-parity block in the stripe.  Does the write-caching
>> that the kernel does mean that each parity block will only get written
>> once?
> 
> if you write sequential, you might see a stripe write thus write only
> once.

Glad to hear it.  In that case, sequential writes to a RAID-5 device
with 4 physical drives should be up to 3 times faster than writes to a
single device (ignoring journaling, time for calculating parity, bus
bandwidth issues, etc).

Is this "stripe write" something that the md layer does to optimize
things?  In other words, does the md layer cache writes and write a
stripe at a time when that's possible?  Or is this just an automatic
effect of the general purpose write-caching that the kernel does?

> but if you write on file system and file system has meta data write, log
> write, then things become complicated. 

Yes.  For now I'm starting at the bottom and working up...

> you can use iostat to see r/w on your disk.

Thanks, I'll try that.

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 14:23           ` Dan Christensen
@ 2005-07-13 14:29             ` Ming Zhang
  2005-07-13 17:56               ` Dan Christensen
  2005-07-13 18:02             ` David Greaves
  1 sibling, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-13 14:29 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

On Wed, 2005-07-13 at 10:23 -0400, Dan Christensen wrote:
> Ming Zhang <mingz@ele.uri.edu> writes:
> 
> > test on a production environment is too dangerous. :P
> > and many benchmark tool u can not perform as well.
> 
> Well, I put "production" in quotes because this is just a home mythtv
> box.  :-)  So there are plenty of times when it is idle and I can do
> benchmarks.  But I can't erase the hard drives in my tests.
> 
> > LVM overhead is small, but file system overhead is hard to say.
> 
> I expected LVM overhead to be small, but in my tests it is very high.
> I plan to discuss this on the lvm mailing list after I've got the RAID
> working as well as possible, but as an example:
> 
> Streaming reads using dd to /dev/null:
> 
> component partitions, e.g. /dev/sda7: 58MB/s
> raid device /dev/md2:                 59MB/s
> lvm device /dev/main/media:           34MB/s
> 
> So something is seriously wrong with my lvm set-up (or with lvm).  The
> lvm device is linearly mapped to the initial blocks of md2, so the
> last two tests should be reading the same blocks from disk.
this is interesting.


> 
> >> My preliminary finding is that raid writes are faster than non-raid
> >> writes:  49MB/s vs 39MB/s.  Still not stellar performance, though.
> >> Question for the list:  if I'm doing a long sequential write, naively
> >> each parity block will get recalculated and rewritten several times,
> >> once for each non-parity block in the stripe.  Does the write-caching
> >> that the kernel does mean that each parity block will only get written
> >> once?
> > 
> > if you write sequential, you might see a stripe write thus write only
> > once.
> 
> Glad to hear it.  In that case, sequential writes to a RAID-5 device
> with 4 physical drives should be up to 3 times faster than writes to a
> single device (ignoring journaling, time for calculating parity, bus
> bandwidth issues, etc).
sounds reasonable but hard to see i feel.

> 
> Is this "stripe write" something that the md layer does to optimize
> things?  In other words, does the md layer cache writes and write a
> stripe at a time when that's possible?  Or is this just an automatic
> effect of the general purpose write-caching that the kernel does?
md people will give you more details. :)




> 
> > but if you write on file system and file system has meta data write, log
> > write, then things become complicated. 
> 
> Yes.  For now I'm starting at the bottom and working up...
> 
> > you can use iostat to see r/w on your disk.
> 
> Thanks, I'll try that.
> 
> Dan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 14:29             ` Ming Zhang
@ 2005-07-13 17:56               ` Dan Christensen
  2005-07-13 22:38                 ` Neil Brown
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Christensen @ 2005-07-13 17:56 UTC (permalink / raw)
  To: linux-raid

Here's a question for people running software raid-5:  do you get
significantly better read speed from a raid-5 device than from it's
component partitions/hard drives, using the simple dd test I did?
Knowing this will help determine whether something is funny with my
set-up and/or hardware, or if just had unrealistic expectations about
software raid performance.

Feel free to reply directly to me if you don't want to clutter the
list.  My dumb script is below.

Thanks,

Dan

#!/bin/sh

dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1
for f in sda7 sdb5 sdc5 sdd5 ; do 
  echo $f; 
  dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec
  echo; 
done

dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1
for f in md2 ; do 
  echo $f; 
  dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec
  echo; 
done

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 17:56               ` Dan Christensen
@ 2005-07-13 22:38                 ` Neil Brown
  2005-07-14  0:09                   ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: Neil Brown @ 2005-07-13 22:38 UTC (permalink / raw)
  To: Dan Christensen; +Cc: linux-raid

On Wednesday July 13, jdc@uwo.ca wrote:
> Here's a question for people running software raid-5:  do you get
> significantly better read speed from a raid-5 device than from it's
> component partitions/hard drives, using the simple dd test I did?

SCSI-160 bus, using just 4 of the 15000rpm drives:

each drive by itself delivers about 67M/s
Three drives in parallel deliver 40M/s each, total of 120M/s
4 give 30M/s each or a total of 120M/s

raid5 over 4 drives delivers 132M/s

(We've just ordered a SCSI-320 card to make better use of the drives).

So with top-quality (and price) hardware, it seems to do the right
thing.

NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 22:38                 ` Neil Brown
@ 2005-07-14  0:09                   ` Ming Zhang
  2005-07-14  1:16                     ` Neil Brown
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-14  0:09 UTC (permalink / raw)
  To: Neil Brown; +Cc: Dan Christensen, Linux RAID

On Thu, 2005-07-14 at 08:38 +1000, Neil Brown wrote:
> On Wednesday July 13, jdc@uwo.ca wrote:
> > Here's a question for people running software raid-5:  do you get
> > significantly better read speed from a raid-5 device than from it's
> > component partitions/hard drives, using the simple dd test I did?
> 
> SCSI-160 bus, using just 4 of the 15000rpm drives:
> 
> each drive by itself delivers about 67M/s
> Three drives in parallel deliver 40M/s each, total of 120M/s
> 4 give 30M/s each or a total of 120M/s
> 
> raid5 over 4 drives delivers 132M/s
why here a 132MB/s instead of 120MB/s (3 * 40MB/s) as u mentioned? any
factor lead to this increase?


> 
> (We've just ordered a SCSI-320 card to make better use of the drives).
> 
> So with top-quality (and price) hardware, it seems to do the right
> thing.
> 
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14  0:09                   ` Ming Zhang
@ 2005-07-14  1:16                     ` Neil Brown
  2005-07-14  1:25                       ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: Neil Brown @ 2005-07-14  1:16 UTC (permalink / raw)
  To: mingz; +Cc: Dan Christensen, Linux RAID

On Wednesday July 13, mingz@ele.uri.edu wrote:
> On Thu, 2005-07-14 at 08:38 +1000, Neil Brown wrote:
> > On Wednesday July 13, jdc@uwo.ca wrote:
> > > Here's a question for people running software raid-5:  do you get
> > > significantly better read speed from a raid-5 device than from it's
> > > component partitions/hard drives, using the simple dd test I did?
> > 
> > SCSI-160 bus, using just 4 of the 15000rpm drives:
> > 
> > each drive by itself delivers about 67M/s
> > Three drives in parallel deliver 40M/s each, total of 120M/s
> > 4 give 30M/s each or a total of 120M/s
> > 
> > raid5 over 4 drives delivers 132M/s
> why here a 132MB/s instead of 120MB/s (3 * 40MB/s) as u mentioned? any
> factor lead to this increase?
> 
> 

I did another test over 10 times the amount of data, and for 34M/s for
4 concurrent individual drives, which multiplies out to 136M/s.  The
same amount of data of the raid5 gives 137M/s, so I think it was just
experimental error.

NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14  1:16                     ` Neil Brown
@ 2005-07-14  1:25                       ` Ming Zhang
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Zhang @ 2005-07-14  1:25 UTC (permalink / raw)
  To: Neil Brown; +Cc: Dan Christensen, Linux RAID

On Thu, 2005-07-14 at 11:16 +1000, Neil Brown wrote:
> On Wednesday July 13, mingz@ele.uri.edu wrote:
> > On Thu, 2005-07-14 at 08:38 +1000, Neil Brown wrote:
> > > On Wednesday July 13, jdc@uwo.ca wrote:
> > > > Here's a question for people running software raid-5:  do you get
> > > > significantly better read speed from a raid-5 device than from it's
> > > > component partitions/hard drives, using the simple dd test I did?
> > > 
> > > SCSI-160 bus, using just 4 of the 15000rpm drives:
> > > 
> > > each drive by itself delivers about 67M/s
> > > Three drives in parallel deliver 40M/s each, total of 120M/s
> > > 4 give 30M/s each or a total of 120M/s
> > > 
> > > raid5 over 4 drives delivers 132M/s
> > why here a 132MB/s instead of 120MB/s (3 * 40MB/s) as u mentioned? any
> > factor lead to this increase?
> > 
> > 
> 
> I did another test over 10 times the amount of data, and for 34M/s for
> 4 concurrent individual drives, which multiplies out to 136M/s.  The
> same amount of data of the raid5 gives 137M/s, so I think it was just
> experimental error.
ic. thanks for explanation. yes, agree. 

it seems that u can get a near linear performance with decent SCSI HW
while what we can get from SATA is not good. :P

Ming

> 
> NeilBrown


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 14:23           ` Dan Christensen
  2005-07-13 14:29             ` Ming Zhang
@ 2005-07-13 18:02             ` David Greaves
  2005-07-13 18:14               ` Ming Zhang
  2005-07-14  3:58               ` Dan Christensen
  1 sibling, 2 replies; 41+ messages in thread
From: David Greaves @ 2005-07-13 18:02 UTC (permalink / raw)
  To: Dan Christensen; +Cc: mingz, Linux RAID

Dan Christensen wrote:

>Ming Zhang <mingz@ele.uri.edu> writes:
>
>  
>
>>test on a production environment is too dangerous. :P
>>and many benchmark tool u can not perform as well.
>>    
>>
>
>Well, I put "production" in quotes because this is just a home mythtv
>box.  :-)  So there are plenty of times when it is idle and I can do
>benchmarks.  But I can't erase the hard drives in my tests.
>  
>
Me too.

>>LVM overhead is small, but file system overhead is hard to say.
>>    
>>
>I expected LVM overhead to be small, but in my tests it is very high.
>I plan to discuss this on the lvm mailing list after I've got the RAID
>working as well as possible, but as an example:
>
>Streaming reads using dd to /dev/null:
>
>component partitions, e.g. /dev/sda7: 58MB/s
>raid device /dev/md2:                 59MB/s
>lvm device /dev/main/media:           34MB/s
>  
>
This is not my experience.
What are the readahead settings?
I found significant variation in performance by varying the readahead at
raw, md and lvm device level

In my setup I get

component partitions, e.g. /dev/sda7: 39MB/s
raid device /dev/md2:                 31MB/s
lvm device /dev/main/media:           53MB/s

(oldish system - but note that lvm device is *much* faster)

For your entertainment you may like to try this to 'tune' your readahead
- it's OK to use so long as you're not recording:

(FYI I find that setting readahead to 0 on all devices and 4096 on the
lvm device gets me the best performance - which makes sense if you think
about it...)

#!/bin/bash
RAW_DEVS="/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/hdb"
MD_DEVS=/dev/md0
LV_DEVS=/dev/huge_vg/huge_lv

LV_RAS="0 128 256 1024 4096 8192"
MD_RAS="0 128 256 1024 4096 8192"
RAW_RAS="0 128 256 1024 4096 8192"

function show_ra()
{
for i in $RAW_DEVS $MD_DEVS $LV_DEVS
do echo -n "$i `blockdev --getra $i`  ::  "
done
echo
}

function set_ra()
{
 RA=$1
 shift
 for dev in $@
 do
   blockdev --setra $RA $dev
 done
}

function show_performance()
{
 COUNT=4000000
 dd if=$LV_DEVS of=/dev/null count=$COUNT 2>&1 | grep seconds
}

for RAW_RA in $RAW_RAS
 do
 set_ra $RAW_RA $RAW_DEVS
 for MD_RA in $MD_RAS
   do
   set_ra $MD_RA $MD_DEVS
   for LV_RA in $LV_RAS
     do
     set_ra $LV_RA $LV_DEVS
     show_ra
     show_performance
     done
   done
 done


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 18:02             ` David Greaves
@ 2005-07-13 18:14               ` Ming Zhang
  2005-07-13 21:18                 ` David Greaves
  2005-07-14  3:58               ` Dan Christensen
  1 sibling, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-13 18:14 UTC (permalink / raw)
  To: David Greaves; +Cc: Dan Christensen, Linux RAID

On Wed, 2005-07-13 at 19:02 +0100, David Greaves wrote:
> Dan Christensen wrote:
> 
> >Ming Zhang <mingz@ele.uri.edu> writes:
> >
> >  
> >
> >>test on a production environment is too dangerous. :P
> >>and many benchmark tool u can not perform as well.
> >>    
> >>
> >
> >Well, I put "production" in quotes because this is just a home mythtv
> >box.  :-)  So there are plenty of times when it is idle and I can do
> >benchmarks.  But I can't erase the hard drives in my tests.
> >  
> >
> Me too.
> 
> >>LVM overhead is small, but file system overhead is hard to say.
> >>    
> >>
> >I expected LVM overhead to be small, but in my tests it is very high.
> >I plan to discuss this on the lvm mailing list after I've got the RAID
> >working as well as possible, but as an example:
> >
> >Streaming reads using dd to /dev/null:
> >
> >component partitions, e.g. /dev/sda7: 58MB/s
> >raid device /dev/md2:                 59MB/s
> >lvm device /dev/main/media:           34MB/s
> >  
> >
> This is not my experience.
> What are the readahead settings?
> I found significant variation in performance by varying the readahead at
> raw, md and lvm device level
> 
> In my setup I get
> 
> component partitions, e.g. /dev/sda7: 39MB/s
> raid device /dev/md2:                 31MB/s
> lvm device /dev/main/media:           53MB/s
> 
> (oldish system - but note that lvm device is *much* faster)

this is so interesting to see! seems that some read ahead parameters
have negative impact.


> 
> For your entertainment you may like to try this to 'tune' your readahead
> - it's OK to use so long as you're not recording:
> 
> (FYI I find that setting readahead to 0 on all devices and 4096 on the
> lvm device gets me the best performance - which makes sense if you think
> about it...)
> 
> #!/bin/bash
> RAW_DEVS="/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/hdb"
> MD_DEVS=/dev/md0
> LV_DEVS=/dev/huge_vg/huge_lv
> 
> LV_RAS="0 128 256 1024 4096 8192"
> MD_RAS="0 128 256 1024 4096 8192"
> RAW_RAS="0 128 256 1024 4096 8192"
> 
> function show_ra()
> {
> for i in $RAW_DEVS $MD_DEVS $LV_DEVS
> do echo -n "$i `blockdev --getra $i`  ::  "
> done
> echo
> }
> 
> function set_ra()
> {
>  RA=$1
>  shift
>  for dev in $@
>  do
>    blockdev --setra $RA $dev
>  done
> }
> 
> function show_performance()
> {
>  COUNT=4000000
>  dd if=$LV_DEVS of=/dev/null count=$COUNT 2>&1 | grep seconds
> }
> 
> for RAW_RA in $RAW_RAS
>  do
>  set_ra $RAW_RA $RAW_DEVS
>  for MD_RA in $MD_RAS
>    do
>    set_ra $MD_RA $MD_DEVS
>    for LV_RA in $LV_RAS
>      do
>      set_ra $LV_RA $LV_DEVS
>      show_ra
>      show_performance
>      done
>    done
>  done
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 18:14               ` Ming Zhang
@ 2005-07-13 21:18                 ` David Greaves
  2005-07-13 21:44                   ` Ming Zhang
  2005-07-13 22:52                   ` Neil Brown
  0 siblings, 2 replies; 41+ messages in thread
From: David Greaves @ 2005-07-13 21:18 UTC (permalink / raw)
  To: mingz; +Cc: Dan Christensen, Linux RAID

Ming Zhang wrote:

>>component partitions, e.g. /dev/sda7: 39MB/s
>>raid device /dev/md2:                 31MB/s
>>lvm device /dev/main/media:           53MB/s
>>
>>(oldish system - but note that lvm device is *much* faster)
>>    
>>
>
>this is so interesting to see! seems that some read ahead parameters
>have negative impact.
>  
>
I guess each raw device does some readahead, then the md0 does some
readahead and then the lvm does some readahead. Theoretically the md0
and lvm should overlap - but I guess that much of the raw device level
readahead is discarded.

David

-- 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 21:18                 ` David Greaves
@ 2005-07-13 21:44                   ` Ming Zhang
  2005-07-13 21:50                     ` David Greaves
  2005-07-13 22:52                   ` Neil Brown
  1 sibling, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-13 21:44 UTC (permalink / raw)
  To: David Greaves; +Cc: Dan Christensen, Linux RAID

On Wed, 2005-07-13 at 22:18 +0100, David Greaves wrote:
> Ming Zhang wrote:
> 
> >>component partitions, e.g. /dev/sda7: 39MB/s
> >>raid device /dev/md2:                 31MB/s
> >>lvm device /dev/main/media:           53MB/s
> >>
> >>(oldish system - but note that lvm device is *much* faster)
> >>    
> >>
> >
> >this is so interesting to see! seems that some read ahead parameters
> >have negative impact.
> >  
> >
> I guess each raw device does some readahead, then the md0 does some
> readahead and then the lvm does some readahead. Theoretically the md0
> and lvm should overlap - but I guess that much of the raw device level
> readahead is discarded.
> 
> David
> 
for a streaming read, what you readahead now will always be used exact
once in near future. at least i think raw device readahead can be turned
on at the same time with one of OS components, raid or lvm, readahead
being turned on. but in your case, u get best result when turn only one
on.

ming



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 21:44                   ` Ming Zhang
@ 2005-07-13 21:50                     ` David Greaves
  2005-07-13 21:55                       ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: David Greaves @ 2005-07-13 21:50 UTC (permalink / raw)
  To: mingz; +Cc: Dan Christensen, Linux RAID

Ming Zhang wrote:

>On Wed, 2005-07-13 at 22:18 +0100, David Greaves wrote:
>  
>
>>Ming Zhang wrote:
>>
>>    
>>
>>>>component partitions, e.g. /dev/sda7: 39MB/s
>>>>raid device /dev/md2:                 31MB/s
>>>>lvm device /dev/main/media:           53MB/s
>>>>
>>>>(oldish system - but note that lvm device is *much* faster)
>>>>   
>>>>
>>>>        
>>>>
>>>this is so interesting to see! seems that some read ahead parameters
>>>have negative impact.
>>> 
>>>
>>>      
>>>
>>I guess each raw device does some readahead, then the md0 does some
>>readahead and then the lvm does some readahead. Theoretically the md0
>>and lvm should overlap - but I guess that much of the raw device level
>>readahead is discarded.
>>
>>David
>>
>>    
>>
>for a streaming read, what you readahead now will always be used exact
>once in near future. at least i think raw device readahead can be turned
>on at the same time with one of OS components, raid or lvm, readahead
>being turned on. but in your case, u get best result when turn only one
>on.
>
>ming
>
>
>
>  
>
I doubt it's just me - what results do others get with that script?

David

-- 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 21:50                     ` David Greaves
@ 2005-07-13 21:55                       ` Ming Zhang
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Zhang @ 2005-07-13 21:55 UTC (permalink / raw)
  To: David Greaves; +Cc: Dan Christensen, Linux RAID

On Wed, 2005-07-13 at 22:50 +0100, David Greaves wrote:
> Ming Zhang wrote:
> 
> >On Wed, 2005-07-13 at 22:18 +0100, David Greaves wrote:
> >  
> >
> >>Ming Zhang wrote:
> >>
> >>    
> >>
> >>>>component partitions, e.g. /dev/sda7: 39MB/s
> >>>>raid device /dev/md2:                 31MB/s
> >>>>lvm device /dev/main/media:           53MB/s
> >>>>
> >>>>(oldish system - but note that lvm device is *much* faster)
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>this is so interesting to see! seems that some read ahead parameters
> >>>have negative impact.
> >>> 
> >>>
> >>>      
> >>>
> >>I guess each raw device does some readahead, then the md0 does some
> >>readahead and then the lvm does some readahead. Theoretically the md0
> >>and lvm should overlap - but I guess that much of the raw device level
> >>readahead is discarded.
> >>
> >>David
> >>
> >>    
> >>
> >for a streaming read, what you readahead now will always be used exact
> >once in near future. at least i think raw device readahead can be turned
> >on at the same time with one of OS components, raid or lvm, readahead
> >being turned on. but in your case, u get best result when turn only one
> >on.
> >
> >ming
> >
> >
> >
> >  
> >
> I doubt it's just me - what results do others get with that script?
> 
> David
> 

my box is in use now. i might try it tomorrow to see what happen. :P

Ming



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 21:18                 ` David Greaves
  2005-07-13 21:44                   ` Ming Zhang
@ 2005-07-13 22:52                   ` Neil Brown
  1 sibling, 0 replies; 41+ messages in thread
From: Neil Brown @ 2005-07-13 22:52 UTC (permalink / raw)
  To: David Greaves; +Cc: mingz, Dan Christensen, Linux RAID

On Wednesday July 13, david@dgreaves.com wrote:
> I guess each raw device does some readahead, then the md0 does some
> readahead and then the lvm does some readahead. Theoretically the md0
> and lvm should overlap - but I guess that much of the raw device level
> readahead is discarded.

No.  Devices don't to readahead (well, modern drives may well
read-ahead into an on-drive buffer, but that is completely transparent
and separate from any readahead that linux does).

Each device just declares how much readahead it thinks is appropriate
for that devices.

The linux mm layer does read-ahead by requesting devices for blocks
that haven't actually been asked for by upper layers.  The amount of
readahead depends on the behaviour of the app doing the reads, and the
setting declared by the devices.

raid5 declares a read-ahead size of twice the stripe size.
i.e. chunks * (disks-1) * 2.

Possibly it should make it bigger if the underlying devices would all
be happy with that, however I haven't given the issue a lot of
thought, and it is tunable from userspace.

NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 18:02             ` David Greaves
  2005-07-13 18:14               ` Ming Zhang
@ 2005-07-14  3:58               ` Dan Christensen
  2005-07-14  4:13                 ` Mark Hahn
                                   ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Dan Christensen @ 2005-07-14  3:58 UTC (permalink / raw)
  To: David Greaves, Linux RAID; +Cc: mingz

David Greaves <david@dgreaves.com> writes:

> In my setup I get
>
> component partitions, e.g. /dev/sda7: 39MB/s
> raid device /dev/md2:                 31MB/s
> lvm device /dev/main/media:           53MB/s
>
> (oldish system - but note that lvm device is *much* faster)

Did you test component device and raid device speed using the
read-ahead settings tuned for lvm reads?  If so, that's not a fair
comparison.  :-)

> For your entertainment you may like to try this to 'tune' your readahead
> - it's OK to use so long as you're not recording:

Thanks, I played around with that a lot.  I tuned readahead to
optimize lvm device reads, and this improved things greatly.  It turns
out the default lvm settings had readahead set to 0!  But by tuning
things, I could get my read speed up to 59MB/s.  This is with raw
device readahead 256, md device readahead 1024 and lvm readahead 2048.
(The speed was most sensitive to the last one, but did seem to depend
on the other ones a bit too.)

I separately tuned the raid device read speed.  To maximize this, I
needed to set the raw device readahead to 1024 and the raid device
readahead to 4096.  This brought my raid read speed from 59MB/s to
78MB/s.  Better!  (But note that now this makes the lvm read speed
look bad.)

My raw device read speed is independent of the readahead setting,
as long as it is at least 256.  The speed is about 58MB/s.

Summary:

raw device:  58MB/s
raid device: 78MB/s
lvm device:  59MB/s

raid still isn't achieving the 106MB/s that I can get with parallel
direct reads, but at least it's getting closer.

As a simple test, I wrote a program like dd that reads and discards
64k chunks of data from a device, but which skips 1 out of every four
chunks (simulating skipping parity blocks).  It's not surprising that
this program can only read from a raw device at about 75% the rate of
dd, since the kernel readahead is probably causing the skipped blocks
to be read anyways (or maybe because the disk head has to pass over
those sections of the disk anyways).

I then ran four copies of this program in parallel, reading from the
raw devices that make up my raid partition.  And, like md, they only
achieved about 78MB/s.  This is very close to 75% of 106MB/s.  Again,
not surprising, since I need to have raw device readahead turned on
for this to be efficient at all, so 25% of the chunks that pass
through the controller are ignored.

But I still don't understand why the md layer can't do better.  If I
turn off readahead of the raw devices, and keep it for the raid
device, then parity blocks should never be requested, so they
shouldn't use any bus/controller bandwidth.  And even if each drive is
only acting at 75% efficiency, the four drives should still be able to
saturate the bus/controller.  So I can't figure out what's going on
here.

Is there a way for me to simulate readahead in userspace, i.e. can
I do lots of sequential asynchronous reads in parallel?

Also, is there a way to disable caching of reads?  Having to clear
the cache by reading 900M each time slows down testing.  I guess
I could reboot with mem=100M, but it'd be nice to disable/enable
caching on the fly.  Hmm, maybe I can just run something like
memtest which locks a bunch of ram...

Thanks for all of the help so far!

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14  3:58               ` Dan Christensen
@ 2005-07-14  4:13                 ` Mark Hahn
  2005-07-14 21:16                   ` Dan Christensen
  2005-07-14 12:30                 ` Ming Zhang
  2005-07-15  2:38                 ` Dan Christensen
  2 siblings, 1 reply; 41+ messages in thread
From: Mark Hahn @ 2005-07-14  4:13 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1321 bytes --]

> > component partitions, e.g. /dev/sda7: 39MB/s
> > raid device /dev/md2:                 31MB/s
> > lvm device /dev/main/media:           53MB/s
> >
> > (oldish system - but note that lvm device is *much* faster)
> 
> Did you test component device and raid device speed using the
> read-ahead settings tuned for lvm reads?  If so, that's not a fair
> comparison.  :-)

I did an eval with a vendor who claimed that their lvm actually
improved bandwidth because it somehow triggered better full-stripe
operations, or readahead, or something.  filtered through a marketing
person, of course ;(

> Is there a way for me to simulate readahead in userspace, i.e. can
> I do lots of sequential asynchronous reads in parallel?

there is async IO, but I don't think this is going to help you much.

> Also, is there a way to disable caching of reads?  Having to clear

yes: O_DIRECT.

I'm attaching a little program I wrote which basically just shows you
incremental bandwidth.  you can use it to show the zones on a disk
(just iorate -r -l 9999 /dev/hda and plot the results), or to
do normal r/w bandwidth without being confused by the page-cache.
you can even use it as a filter to measure tape backup performance.

it doesn't try to do anything with random seeks.  it doesn't do 
anything multi-stream.

regards, mark hahn.

[-- Attachment #2: Type: TEXT/PLAIN, Size: 5440 bytes --]

/* iorate.c - measure rates of sequential IO, showing incremental 
   bandwidth written by Mark Hahn (hahn@mcmaster.ca) 2003,2004,2005
   the main point of this code is to illustrate the danger of 
   running naive bandwidth tests on files that are small relative
   to the memory/disk bandwidth ratio of your system.  that is,
   on any system, the incremental bandwidth will start out huge,
   since IO is purely to the page cache.  once you exceed that size,
   bandwidth will be dominated by the real disk performance.  but 
   using the average of these two modes is a mistake, even if 
   you use very large files.
*/
#define _LARGEFILE64_SOURCE 1
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/time.h>
#include <sys/fcntl.h>
#include <sys/stat.h>
#include <stdarg.h>
#include <string.h>
#include <sys/mman.h>

#ifdef O_LARGEFILE
#define LF O_LARGEFILE
#elif defined(_O_LARGEFILE)
#define LF _O_LARGEFILE
#else
#define LF 0
#endif

#ifndef O_DIRECT
#define O_DIRECT 040000
#endif

typedef unsigned long long u64;

u64 bytes = 0, bytesLast = 0;
double timeStart = 0, timeLast = 0;

/* default reporting interval is every 2 seconds;
   in 2004, an entry-level desktop disk will sustain around 50 MB/s,
   so the default bytes interval is 100 MB.  whichever comes first. */
u64 byteInterval = 100;
double timeInterval = 2;

double gtod() {
    struct timeval tv;
    gettimeofday(&tv,0);
    return tv.tv_sec + 1e-6 * tv.tv_usec;
}

void dumpstats(int force) {
    u64 db = bytes - bytesLast;
    double now = gtod();
    double dt;
    static int first = 1;

    if (timeLast == 0) timeStart = timeLast = now;

    dt = now - timeLast;

    if (!force && db < byteInterval && dt < timeInterval) return;

    if (first) {
	printf("#%7s %7s %7s %7s\n", 
	       "secs",
	       "MB",
	       "MB/sec",
	       "MB/sec");
	first = 0;
    }

    printf("%7.3f %7.3f %7.3f %7.3f\n", 
	   now - timeStart,
	   1e-6 * bytes, 
	   1e-6 * db / dt,
	   1e-6 * bytes / (now-timeStart));
    timeLast = now;
    bytesLast = bytes;
}

void usage() {
    fprintf(stderr,"iorate [-r/w filename] [-d] [-c chunksz][-b byteivl][-t ivl][-l szlim] [-r in] [-w out]\n");
    fprintf(stderr,"-r in or -w out select which file is read or written ('-' for stdin/out)\n");
    fprintf(stderr,"-c chunksz - size of chunks written (KB);\n");
    fprintf(stderr,"-t timeinterval - collect rate each timeinterval seconds;\n");
    fprintf(stderr,"-b byteinterval - collect rate each byteinterval MB;\n");
    fprintf(stderr,"-l limit - total output size limit (MB);\n");
    fprintf(stderr,"-d use O_DIRECT\n");
    fprintf(stderr,"defaults are: '-c 8 -b 20 -t 10 -l 10'\n");
    exit(1);
}

void fatal(char *format, ...) {
    va_list ap;
    va_start(ap,format);

    vfprintf(stderr,format,ap);
    fprintf(stderr,": errno=%d (%s)\n",errno,strerror(errno));
    va_end(ap);
    dumpstats(1);
    exit(1);
}

/* allocate a buffer using mmap to ensure it's page-aligned.  
   O_DIRECT *could* be more strict than that, but probably isn't */
void *myalloc(unsigned size) {
    unsigned s = (size + 4095) & ~4095U;
    void *p = mmap(0, 
                   s, 
                   PROT_READ|PROT_WRITE, 
                   MAP_ANONYMOUS|MAP_PRIVATE, 
                   -1, 0);
    if (p == MAP_FAILED)
        return 0;
    return p;
}

int main(int argc, char *argv[]) {
    unsigned size = 8;
    char *buffer;
    u64 limit = 10;
    char *fnameI = 0;
    char *fnameO = 0;
    int fdI = 0;
    int fdO = 1;
    int doRead = 0;
    int doWrite = 0;
    int doDirect = 0;

    int letter;
    while ((letter = getopt(argc,argv,"r:w:b:c:l:t:d")) != -1) {
	switch(letter) {
	case 'r':
	    fnameI = optarg;
	    doRead = 1;
	    break;
	case 'w':
	    fnameO = optarg;
	    doWrite = 1;
	    break;
	case 'b':
	    byteInterval = atoi(optarg);
	    break;
	case 'c':
	    size = atoi(optarg);
	    break;
	case 'l':
	    limit = atoi(optarg);
	    break;
	case 't':
	    timeInterval = atof(optarg);
	    break;
	case 'd':
	    doDirect = 1;
	    break;
	default:
	    usage();
	}
    }
    
    if (argc != optind) usage();

    byteInterval *= 1e6;
    limit *= 1e6;
    size *= 1024;

    setbuf(stdout, 0);

    fprintf(stderr,"chunk %dK, byteInterval %uM, timeInterval %f, limit %uM\n",
	    size>>10,
	    (unsigned)(byteInterval>>20),
	    timeInterval,
	    (unsigned)(limit>>20));

    if (doRead && fnameI && strcmp(fnameI,"-")) {
	fdI = open(fnameI, O_RDONLY | LF);
	if (fdI == -1) fatal("open(read) failed");
    }
    if (doWrite && fnameO && strcmp(fnameO,"-")) {
	int flags = O_RDWR | O_CREAT | LF;
	if (doDirect) flags |= O_DIRECT;

	fdO = open(fnameO, flags, 0600);
	if (fdO == -1) fatal("open(write) failed");
    }

    buffer = myalloc(size);
    memset(buffer,'m',size);

    timeStart = timeLast = gtod();

    bytes = 0;

    while (bytes < limit) {
	int c = size;

    	dumpstats(0);

	if (doRead) {
	    c = read(fdI,buffer,c);
	    if (c == -1) fatal("read failed");
	}

	if (doWrite) {
	    c = write(fdO,buffer,c);
	    if (c == -1) fatal("write failed");
	}

	bytes += c;

	/* short read/write means EOF. */
	if (c < size) break;
    }

    dumpstats(1);

    return 0;
}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14  4:13                 ` Mark Hahn
@ 2005-07-14 21:16                   ` Dan Christensen
  2005-07-14 21:30                     ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Christensen @ 2005-07-14 21:16 UTC (permalink / raw)
  To: linux-raid

Mark Hahn <hahn@physics.mcmaster.ca> writes:

>> Is there a way for me to simulate readahead in userspace, i.e. can
>> I do lots of sequential asynchronous reads in parallel?
>
> there is async IO, but I don't think this is going to help you much.
>
>> Also, is there a way to disable caching of reads?  Having to clear
>
> yes: O_DIRECT.

That might disable caching of reads, but it also disables readahead,
so unless I manually use aio to simulate readahead, this isn't going
to solve my problem, which is having to clear the cache before each
test to get relevant results.

I'm really surprised there isn't something in /proc you can use to
clear or disable the cache.  Would be very useful for benchmarking!

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 21:16                   ` Dan Christensen
@ 2005-07-14 21:30                     ` Ming Zhang
  2005-07-14 23:29                       ` Mark Hahn
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-14 21:30 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

i also want a way to clear part of the whole page cache by file id. :)
i also want a way to tell the cache distribution, how many for file A
and B, ....

ming

On Thu, 2005-07-14 at 17:16 -0400, Dan Christensen wrote:
> I'm really surprised there isn't something in /proc you can use to
> clear or disable the cache.  Would be very useful for benchmarking!
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 21:30                     ` Ming Zhang
@ 2005-07-14 23:29                       ` Mark Hahn
  2005-07-15  1:23                         ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: Mark Hahn @ 2005-07-14 23:29 UTC (permalink / raw)
  To: Ming Zhang; +Cc: Dan Christensen, Linux RAID

> i also want a way to clear part of the whole page cache by file id. :)

understandably, kernel developers are don't high-prioritize this sort of 
not-useful-for-normal-work feature.

> i also want a way to tell the cache distribution, how many for file A
> and B, ....

you should probably try mmaping the file and using mincore.
come to think of it, mmap+madvise might be a sensible way to 
flush pages corresponding to a particular file, as well.

> > I'm really surprised there isn't something in /proc you can use to
> > clear or disable the cache.  Would be very useful for benchmarking!

I assume you noticed "blockdev --flushbufs", no?  it works for me 
(ie, a small, repeated streaming read of a disk device will show 
pagecache speed).

I think the problem is that it's difficult to dissociate readahead,
writebehind and normal lru-ish caching.  there was quite a flurry of 
activity around 2.4.10 related to this, and it left a bad taste in 
everyone's mouth.  I think the main conclusion was that too much fanciness
results in a fragile, more subtle and difficult-to-maintain system 
that performs better, true, but over a narrower range of workloads.

regards, mark hahn
sharcnet/mcmaster.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 23:29                       ` Mark Hahn
@ 2005-07-15  1:23                         ` Ming Zhang
  2005-07-15  2:11                           ` Dan Christensen
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-15  1:23 UTC (permalink / raw)
  To: Mark Hahn; +Cc: Dan Christensen, Linux RAID

On Thu, 2005-07-14 at 19:29 -0400, Mark Hahn wrote:
> > i also want a way to clear part of the whole page cache by file id. :)
> 
> understandably, kernel developers are don't high-prioritize this sort of 
> not-useful-for-normal-work feature.
agree.

> 
> > i also want a way to tell the cache distribution, how many for file A
> > and B, ....
> 
> you should probably try mmaping the file and using mincore.
> come to think of it, mmap+madvise might be a sensible way to 
> flush pages corresponding to a particular file, as well.
i prefer a generic way. :) it will be useful to tune the system. maybe
write a program to iterate the kernel structure can do this.


> 
> > > I'm really surprised there isn't something in /proc you can use to
> > > clear or disable the cache.  Would be very useful for benchmarking!
> 
> I assume you noticed "blockdev --flushbufs", no?  it works for me 
> (ie, a small, repeated streaming read of a disk device will show 
> pagecache speed).
it will do flush right? but will it flush and clean cache?


> 
> I think the problem is that it's difficult to dissociate readahead,
> writebehind and normal lru-ish caching.  there was quite a flurry of 
> activity around 2.4.10 related to this, and it left a bad taste in 
> everyone's mouth.  I think the main conclusion was that too much fanciness
> results in a fragile, more subtle and difficult-to-maintain system 
> that performs better, true, but over a narrower range of workloads.
maybe this will happen again for 2.6.x? i think there are still many
gray areas that can be checked. also many places can be improved. a test
i did show that even you have sda and sdb to form a raid0, the page
cache for sda and sdb will not be used by raid0. kind of funny.



> 
> regards, mark hahn
> sharcnet/mcmaster.
> 

thx!

Ming



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-15  1:23                         ` Ming Zhang
@ 2005-07-15  2:11                           ` Dan Christensen
  2005-07-15 12:28                             ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Christensen @ 2005-07-15  2:11 UTC (permalink / raw)
  To: linux-raid

Ming Zhang <mingz@ele.uri.edu> writes:

> On Thu, 2005-07-14 at 19:29 -0400, Mark Hahn wrote:
>>
>> > i also want a way to clear part of the whole page cache by file id. :)
>> 
>> understandably, kernel developers are don't high-prioritize this sort of 
>> not-useful-for-normal-work feature.
> agree.

Clearing just part of the page cache sounds too complicated to be
worth it, but clearing it all seems reasonable;  some kernel developers
spend time doing benchmarks too!

>> > Dan Christensen wrote:
>> >
>> > > I'm really surprised there isn't something in /proc you can use to
>> > > clear or disable the cache.  Would be very useful for benchmarking!
>> 
>> I assume you noticed "blockdev --flushbufs", no?  it works for me 

I had tried this and noticed that it didn't work for files on a
filesystem.  But it does seem to work for block devices.  That's
great, thanks.  I didn't realize the cache was so complicated;
it can be retained for files but not for the block device underlying
those files!  

> a test i did show that even you have sda and sdb to form a raid0,
> the page cache for sda and sdb will not be used by raid0. kind of
> funny.

I thought I had noticed raid devices making use of cache from
underlying devices, but a test I just did agrees with your result, for
both RAID-1 and RAID-5.  Again, this seems odd.  Shouldn't the raid
layer take advantage of a block that's already in RAM?  I guess this
won't matter in practice, since you usually don't read from both a
raid device and an underlying device.

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-15  2:11                           ` Dan Christensen
@ 2005-07-15 12:28                             ` Ming Zhang
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Zhang @ 2005-07-15 12:28 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

On Thu, 2005-07-14 at 22:11 -0400, Dan Christensen wrote:
> Ming Zhang <mingz@ele.uri.edu> writes:
> 
> > On Thu, 2005-07-14 at 19:29 -0400, Mark Hahn wrote:
> >>
> >> > i also want a way to clear part of the whole page cache by file id. :)
> >> 
> >> understandably, kernel developers are don't high-prioritize this sort of 
> >> not-useful-for-normal-work feature.
> > agree.
> 
> Clearing just part of the page cache sounds too complicated to be
> worth it, but clearing it all seems reasonable;  some kernel developers
> spend time doing benchmarks too!

maybe they do not care to run a program to clear it every time. :P



> 
> >> > Dan Christensen wrote:
> >> >
> >> > > I'm really surprised there isn't something in /proc you can use to
> >> > > clear or disable the cache.  Would be very useful for benchmarking!
> >> 
> >> I assume you noticed "blockdev --flushbufs", no?  it works for me 
> 
> I had tried this and noticed that it didn't work for files on a
> filesystem.  But it does seem to work for block devices.  That's
> great, thanks.  I didn't realize the cache was so complicated;
> it can be retained for files but not for the block device underlying
> those files!  

yes, that is the why the command name is blockdev. :) i guess for files
we just need to call fsync system call? is that call work on block
device as well?


> 
> > a test i did show that even you have sda and sdb to form a raid0,
> > the page cache for sda and sdb will not be used by raid0. kind of
> > funny.
> 
> I thought I had noticed raid devices making use of cache from
> underlying devices, but a test I just did agrees with your result, for
> both RAID-1 and RAID-5.  Again, this seems odd.  Shouldn't the raid
> layer take advantage of a block that's already in RAM?  I guess this
> won't matter in practice, since you usually don't read from both a
> raid device and an underlying device.

you are right, that is weired in real world.

ming



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14  3:58               ` Dan Christensen
  2005-07-14  4:13                 ` Mark Hahn
@ 2005-07-14 12:30                 ` Ming Zhang
  2005-07-14 14:23                   ` Ming Zhang
  2005-07-14 17:54                   ` Dan Christensen
  2005-07-15  2:38                 ` Dan Christensen
  2 siblings, 2 replies; 41+ messages in thread
From: Ming Zhang @ 2005-07-14 12:30 UTC (permalink / raw)
  To: Dan Christensen; +Cc: David Greaves, Linux RAID

On Wed, 2005-07-13 at 23:58 -0400, Dan Christensen wrote:
> David Greaves <david@dgreaves.com> writes:
> 
> > In my setup I get
> >
> > component partitions, e.g. /dev/sda7: 39MB/s
> > raid device /dev/md2:                 31MB/s
> > lvm device /dev/main/media:           53MB/s
> >
> > (oldish system - but note that lvm device is *much* faster)
> 
> Did you test component device and raid device speed using the
> read-ahead settings tuned for lvm reads?  If so, that's not a fair
> comparison.  :-)
> 
> > For your entertainment you may like to try this to 'tune' your readahead
> > - it's OK to use so long as you're not recording:
> 
> Thanks, I played around with that a lot.  I tuned readahead to
> optimize lvm device reads, and this improved things greatly.  It turns
> out the default lvm settings had readahead set to 0!  But by tuning
> things, I could get my read speed up to 59MB/s.  This is with raw
> device readahead 256, md device readahead 1024 and lvm readahead 2048.
> (The speed was most sensitive to the last one, but did seem to depend
> on the other ones a bit too.)
> 
> I separately tuned the raid device read speed.  To maximize this, I
> needed to set the raw device readahead to 1024 and the raid device
> readahead to 4096.  This brought my raid read speed from 59MB/s to
> 78MB/s.  Better!  (But note that now this makes the lvm read speed
> look bad.)
> 
> My raw device read speed is independent of the readahead setting,
> as long as it is at least 256.  The speed is about 58MB/s.
> 
> Summary:
> 
> raw device:  58MB/s
> raid device: 78MB/s
> lvm device:  59MB/s
> 
> raid still isn't achieving the 106MB/s that I can get with parallel
> direct reads, but at least it's getting closer.
> 
> As a simple test, I wrote a program like dd that reads and discards
> 64k chunks of data from a device, but which skips 1 out of every four
> chunks (simulating skipping parity blocks).  It's not surprising that
> this program can only read from a raw device at about 75% the rate of
> dd, since the kernel readahead is probably causing the skipped blocks
> to be read anyways (or maybe because the disk head has to pass over
> those sections of the disk anyways).
> 
> I then ran four copies of this program in parallel, reading from the
> raw devices that make up my raid partition.  And, like md, they only
> achieved about 78MB/s.  This is very close to 75% of 106MB/s.  Again,
> not surprising, since I need to have raw device readahead turned on
> for this to be efficient at all, so 25% of the chunks that pass
> through the controller are ignored.
> 
> But I still don't understand why the md layer can't do better.  If I
> turn off readahead of the raw devices, and keep it for the raid
> device, then parity blocks should never be requested, so they
> shouldn't use any bus/controller bandwidth.  And even if each drive is
> only acting at 75% efficiency, the four drives should still be able to
> saturate the bus/controller.  So I can't figure out what's going on
> here.
when read, i do not think MD will read parity at all. but since parity
is on all disk, there might be a seek here. so you might want to try a
raid4 to see what happen as well.




> 
> Is there a way for me to simulate readahead in userspace, i.e. can
> I do lots of sequential asynchronous reads in parallel?
> 
> Also, is there a way to disable caching of reads?  Having to clear
> the cache by reading 900M each time slows down testing.  I guess
> I could reboot with mem=100M, but it'd be nice to disable/enable
> caching on the fly.  Hmm, maybe I can just run something like
> memtest which locks a bunch of ram...
after you run your code, check the meminfo, the cached value might be
much lower than u expected. my feeling is that linux page cache will
discard all cache if last file handle closed.


> 
> Thanks for all of the help so far!
> 
> Dan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 12:30                 ` Ming Zhang
@ 2005-07-14 14:23                   ` Ming Zhang
  2005-07-14 17:54                   ` Dan Christensen
  1 sibling, 0 replies; 41+ messages in thread
From: Ming Zhang @ 2005-07-14 14:23 UTC (permalink / raw)
  To: Dan Christensen; +Cc: David Greaves, Linux RAID

my problem here. this only apply to sdX not mdX. pls ignore this.

ming

On Thu, 2005-07-14 at 08:30 -0400, Ming Zhang wrote:
> > Also, is there a way to disable caching of reads?  Having to clear
> > the cache by reading 900M each time slows down testing.  I guess
> > I could reboot with mem=100M, but it'd be nice to disable/enable
> > caching on the fly.  Hmm, maybe I can just run something like
> > memtest which locks a bunch of ram...
> after you run your code, check the meminfo, the cached value might be
> much lower than u expected. my feeling is that linux page cache will
> discard all cache if last file handle closed.
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 12:30                 ` Ming Zhang
  2005-07-14 14:23                   ` Ming Zhang
@ 2005-07-14 17:54                   ` Dan Christensen
  2005-07-14 18:00                     ` Ming Zhang
  1 sibling, 1 reply; 41+ messages in thread
From: Dan Christensen @ 2005-07-14 17:54 UTC (permalink / raw)
  To: linux-raid

Ming Zhang <mingz@ele.uri.edu> writes:

> On Wed, 2005-07-13 at 23:58 -0400, Dan Christensen wrote:
>
>> But I still don't understand why the md layer can't do better.  If I
>> turn off readahead of the raw devices, and keep it for the raid
>> device, then parity blocks should never be requested, so they
>> shouldn't use any bus/controller bandwidth.  And even if each drive is
>> only acting at 75% efficiency, the four drives should still be able to
>> saturate the bus/controller.  So I can't figure out what's going on
>> here.
> 
> when read, i do not think MD will read parity at all. but since parity
> is on all disk, there might be a seek here.

Yes, there will be a seek, or internal drive readahead, so each drive
will operate at around 75% efficiency.  But since that shouldn't
affect bus/controller traffic, I still would expect to get over
100MB/s with my hardware.

>> Also, is there a way to disable caching of reads?
> 
> after you run your code, check the meminfo, the cached value might be
> much lower than u expected. my feeling is that linux page cache will
> discard all cache if last file handle closed.

Ming Zhang <mingz@ele.uri.edu> writes:

> my problem here. this only apply to sdX not mdX. pls ignore this.

I'm not sure what you mean.  For reads from sdX, mdX, files on sdX
or files on mdX, the cache is retained.  So it's necessary to clear
this cache to get valid timing results.

Dan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 17:54                   ` Dan Christensen
@ 2005-07-14 18:00                     ` Ming Zhang
  2005-07-14 18:03                       ` Dan Christensen
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-14 18:00 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

On Thu, 2005-07-14 at 13:54 -0400, Dan Christensen wrote:
> Ming Zhang <mingz@ele.uri.edu> writes:
> 
> > On Wed, 2005-07-13 at 23:58 -0400, Dan Christensen wrote:
> >
> >> But I still don't understand why the md layer can't do better.  If I
> >> turn off readahead of the raw devices, and keep it for the raid
> >> device, then parity blocks should never be requested, so they
> >> shouldn't use any bus/controller bandwidth.  And even if each drive is
> >> only acting at 75% efficiency, the four drives should still be able to
> >> saturate the bus/controller.  So I can't figure out what's going on
> >> here.
> > 
> > when read, i do not think MD will read parity at all. but since parity
> > is on all disk, there might be a seek here.
> 
> Yes, there will be a seek, or internal drive readahead, so each drive
> will operate at around 75% efficiency.  But since that shouldn't
> affect bus/controller traffic, I still would expect to get over
> 100MB/s with my hardware.
agree. but what if your controller is a bottleneck? u need to have
another card to find out.



> 
> >> Also, is there a way to disable caching of reads?
> > 
> > after you run your code, check the meminfo, the cached value might be
> > much lower than u expected. my feeling is that linux page cache will
> > discard all cache if last file handle closed.
> 
> Ming Zhang <mingz@ele.uri.edu> writes:
> 
> > my problem here. this only apply to sdX not mdX. pls ignore this.
> 
> I'm not sure what you mean.  For reads from sdX, mdX, files on sdX
> or files on mdX, the cache is retained.  So it's necessary to clear
> this cache to get valid timing results.

yes, i was insane at that time, pls ignore these blah blah.


> 
> Dan
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 18:00                     ` Ming Zhang
@ 2005-07-14 18:03                       ` Dan Christensen
  2005-07-14 18:10                         ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Christensen @ 2005-07-14 18:03 UTC (permalink / raw)
  To: mingz; +Cc: Linux RAID

[Ming, could you trim quoted material down a bit more, and leave a
blank line between quoted material and your new text?  Thanks.]

Ming Zhang <mingz@ele.uri.edu> writes:

> On Thu, 2005-07-14 at 13:54 -0400, Dan Christensen wrote:
>> 
>> Yes, there will be a seek, or internal drive readahead, so each drive
>> will operate at around 75% efficiency.  But since that shouldn't
>> affect bus/controller traffic, I still would expect to get over
>> 100MB/s with my hardware.
>
> agree. but what if your controller is a bottleneck? u need to have
> another card to find out.

The controller and/or bus *is* the bottleneck, but I've already shown
that I can get 106MB/s through them.

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 18:03                       ` Dan Christensen
@ 2005-07-14 18:10                         ` Ming Zhang
  2005-07-14 19:16                           ` Dan Christensen
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Zhang @ 2005-07-14 18:10 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

On Thu, 2005-07-14 at 14:03 -0400, Dan Christensen wrote:
> [Ming, could you trim quoted material down a bit more, and leave a
> blank line between quoted material and your new text?  Thanks.]

thanks. sorry about that.

> 
> Ming Zhang <mingz@ele.uri.edu> writes:
> 
> > On Thu, 2005-07-14 at 13:54 -0400, Dan Christensen wrote:
> >> 
> > agree. but what if your controller is a bottleneck? u need to have
> > another card to find out.
> 
> The controller and/or bus *is* the bottleneck, but I've already shown
> that I can get 106MB/s through them.
> 
> Dan

then can u test RAID0 a bit? That is easier to analyze.

Ming



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 18:10                         ` Ming Zhang
@ 2005-07-14 19:16                           ` Dan Christensen
  2005-07-14 20:13                             ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Christensen @ 2005-07-14 19:16 UTC (permalink / raw)
  To: linux-raid

Ming Zhang <mingz@ele.uri.edu> writes:

> then can u test RAID0 a bit? That is easier to analyze.

I can't easily test RAID-0 with my set-up, but I can test RAID-1
with two partitions.  I found that the read speed from the md
device was about the same as the read speed from each partition.
This was with readahead set to 4096 on the md device, so I had
hoped that it would do better.  Based on the output of iostat,
it looks like the reads were shared roughly equally between the
two partitions (53%/47%).  Does the RAID-1 code try to take
the first stripe from disk 1, the second from disk 2, alternately?
Or is it clever enough to try to take the first dozen from disk 1,
the next dozen from disk 2, etc, in order to get larger, contiguous
reads? 

It's less clear to me that RAID-1 with two drives will be able to
overcome the overhead of skipping various blocks.  But it seems
like RAID-5 with four drives should be able to saturate my
bus/controller.  For example, RAID-5 could just do sequential
reads from 3 of the 4 drives, and use the parity chunks it
reads to reconstruct the data chunks from the fourth drive.
If I do parallel reads from 3 of my 4 disks, I can still get
106MB/s.

Dan

PS: Here's my simple test script, cleaned up a bit:

#!/bin/sh

# Devices to test for speed, and megabytes to read.
MDDEV=/dev/md2
MDMB=300
RAWDEVS="/dev/sda7 /dev/sdb5 /dev/sdc5 /dev/sdd5"
RAWMB=300

# Device to read to clear cache, and amount in megabytes.
CACHEDEV=/dev/sda8
CACHEMB=900

clearcache () {
  echo "Clearing cache..."
  dd if=$CACHEDEV of=/dev/null bs=1M count=$CACHEMB > /dev/null 2>&1
}

testdev () {
  echo "Read test from $1..." 
  dd if=$1 of=/dev/null bs=1M count=$2 2>&1 | grep bytes/sec
  echo 
}

clearcache
for f in $RAWDEVS ; do 
  testdev $f $RAWMB
done

clearcache
testdev $MDDEV $MDMB

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14 19:16                           ` Dan Christensen
@ 2005-07-14 20:13                             ` Ming Zhang
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Zhang @ 2005-07-14 20:13 UTC (permalink / raw)
  To: Dan Christensen; +Cc: Linux RAID

raid5 can not be that smart. :P

ming

On Thu, 2005-07-14 at 15:16 -0400, Dan Christensen wrote:
> bus/controller.  For example, RAID-5 could just do sequential
> reads from 3 of the 4 drives, and use the parity chunks it
> reads to reconstruct the data chunks from the fourth drive.
> If I do parallel reads from 3 of my 4 disks, I can still get
> 106MB/s.
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-14  3:58               ` Dan Christensen
  2005-07-14  4:13                 ` Mark Hahn
  2005-07-14 12:30                 ` Ming Zhang
@ 2005-07-15  2:38                 ` Dan Christensen
  2005-07-15  6:01                   ` Holger Kiehl
  2 siblings, 1 reply; 41+ messages in thread
From: Dan Christensen @ 2005-07-15  2:38 UTC (permalink / raw)
  To: linux-raid

Summary so far:

RAID-5, four SATA hard drives, 2.6.12.2 kernel.  Testing streaming
read speed.  With readahead optimized, I get:

each raw device:        58MB/s
raid device:            78MB/s
3 or 4 parallel reads
from the raw devices:  106MB/s

I'm trying to figure out why the last two numbers differ.

I was afraid that for some reason the kernel was requesting the parity
blocks instead of just the data blocks, but by using iostat it's
pretty clear that the right number of blocks are being requested from
the raw devices.  If I write a dumb program that reads 3 out of every
4 64k chunks of a raw device, the kernel readahead kicks in and chunks
I skip over do contribute to the iostat numbers.  But the raid layer
is correctly avoiding this readahead.

One other theory at this point is that my controller is trying to be
clever and doing some readahead itself.  Even if this is the case, I'd
be surprised if this would cause a problem, since the data won't have
to go over the bus.  But maybe the controller is doing this and is
causing itself to become overloaded?  My controller is a Silicon Image
3114.  Details at the end, for the record.

Second theory:  for contiguous streams from the raw devices, the reads
are done in really big chunks.  But for md layer reads, the biggest
possible chunk is 3 x 64k, if you want to skip parity blocks.  Could
3 x 64k be small enough to cause overhead?  Seems unlikely.

Those are my only guesses.  Any others?

It seems strange that I can beat the md layer in userspace by 33%, by
just reading from three of the devices and using parity to reconstruct
the fourth!

Thanks again for all the help.  I've learned a lot!  And I haven't
even started working on write speed...

Dan

0000:01:0b.0 RAID bus controller: Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
        Subsystem: Silicon Image, Inc. (formerly CMD Technology Inc): Unknown device 6114
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 32, Cache Line Size: 0x08 (32 bytes)
        Interrupt: pin A routed to IRQ 177
        Region 0: I/O ports at 9400 [size=8]
        Region 1: I/O ports at 9800 [size=4]
        Region 2: I/O ports at 9c00 [size=8]
        Region 3: I/O ports at a000 [size=4]
        Region 4: I/O ports at a400 [size=16]
        Region 5: Memory at e1001000 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [60] Power Management version 2
                Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=2 PME-

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-15  2:38                 ` Dan Christensen
@ 2005-07-15  6:01                   ` Holger Kiehl
  2005-07-15 12:29                     ` Ming Zhang
  0 siblings, 1 reply; 41+ messages in thread
From: Holger Kiehl @ 2005-07-15  6:01 UTC (permalink / raw)
  To: Dan Christensen; +Cc: linux-raid

Hello

On Thu, 14 Jul 2005, Dan Christensen wrote:

> Summary so far:
>
> RAID-5, four SATA hard drives, 2.6.12.2 kernel.  Testing streaming
> read speed.  With readahead optimized, I get:
>
> each raw device:        58MB/s
> raid device:            78MB/s
> 3 or 4 parallel reads
> from the raw devices:  106MB/s
>
> I'm trying to figure out why the last two numbers differ.
>
Have you checked what the performance with a 2.4.x kernel is? If I
remember correctly there was some discussion on this list that 2.4 raid5
has better read performance.

Holger


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-15  6:01                   ` Holger Kiehl
@ 2005-07-15 12:29                     ` Ming Zhang
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Zhang @ 2005-07-15 12:29 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: Dan Christensen, Linux RAID

in my previous test, using SATA, i got better result in 2.6 instead of
2.4. :P

Ming

On Fri, 2005-07-15 at 06:01 +0000, Holger Kiehl wrote:
> > I'm trying to figure out why the last two numbers differ.
> >
> Have you checked what the performance with a 2.4.x kernel is? If I
> remember correctly there was some discussion on this list that 2.4
> raid5
> has better read performance.
> 
> Holger
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: RAID-5 streaming read performance
  2005-07-13 12:48       ` Dan Christensen
  2005-07-13 12:52         ` Ming Zhang
@ 2005-07-13 22:42         ` Neil Brown
  1 sibling, 0 replies; 41+ messages in thread
From: Neil Brown @ 2005-07-13 22:42 UTC (permalink / raw)
  To: Dan Christensen; +Cc: mingz, Linux RAID

On Wednesday July 13, jdc@uwo.ca wrote:
> Question for the list:  if I'm doing a long sequential write, naively
> each parity block will get recalculated and rewritten several times,
> once for each non-parity block in the stripe.  Does the write-caching
> that the kernel does mean that each parity block will only get written
> once?

Raid5 does the best it can.  It delays write requests as long as
possible, and then when it must do the write, it writes every other
block in the stripe that it has been asked to write, so only one
parity update is needed for all those blocks.

My tests suggest that for long sequential writes (Without syncs) this
achieves full-stripe writes most of the time.

NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2005-07-15 12:29 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-11 15:11 RAID-5 streaming read performance Dan Christensen
2005-07-13  2:08 ` Ming Zhang
2005-07-13  2:52   ` Dan Christensen
2005-07-13  3:15     ` berk walker
2005-07-13 12:24     ` Ming Zhang
2005-07-13 12:48       ` Dan Christensen
2005-07-13 12:52         ` Ming Zhang
2005-07-13 14:23           ` Dan Christensen
2005-07-13 14:29             ` Ming Zhang
2005-07-13 17:56               ` Dan Christensen
2005-07-13 22:38                 ` Neil Brown
2005-07-14  0:09                   ` Ming Zhang
2005-07-14  1:16                     ` Neil Brown
2005-07-14  1:25                       ` Ming Zhang
2005-07-13 18:02             ` David Greaves
2005-07-13 18:14               ` Ming Zhang
2005-07-13 21:18                 ` David Greaves
2005-07-13 21:44                   ` Ming Zhang
2005-07-13 21:50                     ` David Greaves
2005-07-13 21:55                       ` Ming Zhang
2005-07-13 22:52                   ` Neil Brown
2005-07-14  3:58               ` Dan Christensen
2005-07-14  4:13                 ` Mark Hahn
2005-07-14 21:16                   ` Dan Christensen
2005-07-14 21:30                     ` Ming Zhang
2005-07-14 23:29                       ` Mark Hahn
2005-07-15  1:23                         ` Ming Zhang
2005-07-15  2:11                           ` Dan Christensen
2005-07-15 12:28                             ` Ming Zhang
2005-07-14 12:30                 ` Ming Zhang
2005-07-14 14:23                   ` Ming Zhang
2005-07-14 17:54                   ` Dan Christensen
2005-07-14 18:00                     ` Ming Zhang
2005-07-14 18:03                       ` Dan Christensen
2005-07-14 18:10                         ` Ming Zhang
2005-07-14 19:16                           ` Dan Christensen
2005-07-14 20:13                             ` Ming Zhang
2005-07-15  2:38                 ` Dan Christensen
2005-07-15  6:01                   ` Holger Kiehl
2005-07-15 12:29                     ` Ming Zhang
2005-07-13 22:42         ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).