Odd (slow) RAID performance

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Odd (slow) RAID performance
@ 2006-11-30 14:13 Bill Davidsen
  2006-11-30 14:31 ` Roger Lucas
  0 siblings, 1 reply; 20+ messages in thread
From: Bill Davidsen @ 2006-11-30 14:13 UTC (permalink / raw)
  To: linux-raid

Pardon if you see this twice, I sent it last night and it never showed up...

I was seeing some bad disk performance on a new install of Fedora Core 
6, so I did some measurements of write speed, and it would appear that 
write performance is so slow it can't write my data as fast as it is 
generated :-(

The method: I wrote 2GB of data to various configurations with

	sync; time bash -c "dd if=/dev/zero bs=1024k count=2048 of=XXXXX; sync"

where XXXXX was a raw partition, raw RAID device, or ext2 filesystem 
over a RAID device. I recorded the time reported by dd, which doesn't 
include a final sync, and total time from start of write to end of sync, 
which I believe represents the true effective performance. All tests 
were run on a dedicated system, with the RAID devices or filesystem 
freshly created.

For a baseline, I wrote to a single drive, single raw partition, which 
gave about 50MB/s transfer. Then I created a RAID-0 device, striped over 
three test drives. As expected this gave a speed of about 147 MB/s. Then 
I created an ext2 filesystem over that device, and the test showed 139 
MB/s speed. This was as expected.

Then I stopped and deleted the RAID-0 and built a RAID-5 on the same 
partitions. A write to this raw RAID device showed only 37.5 MB/s!! 
Putting an ext2 f/s over that device dropped the speed to 35 MB/s. Since 
I am trying to write bursts at 60MB/s, this is a serious problem for me.

Then I recreated a new RAID-10 array on the same partitions. This showed 
a write speed of 75.8 MB/s, double the speed even though I was 
(presumably) writing twice the data. And and ext2 f/s on that array 
showed 74 MB/s write speed. I didn't use /proc/diskstats to gather 
actual counts, nor do I know if they show actual transfer data below all 
the levels of o/s magic, but that sounds as if RAID-5 is not working 
right. I don't have enough space to use RAID-10 for incoming data, so 
that's not an option.

Then I thought that perhaps my chunk size, defaulted to 64k, was too 
small. So I created and array with 256k chunk size. That showed about 36 
MB/s to the raw array, and 32.4 MB/s to an ext2 f/s using the array. 
Finally I decided to create a new f/s using the "stride=" option, and 
see if that would work better. I had 256k chunks, two data and a parity 
per stripe, so I used the data size, 512k, for calculation. The man page 
says to use the f/s block size, 4k in this case, for calculation, so 
512/4 was 128 stride size, and I used that. The increase was below the 
noise, about 50KB/s faster.

Any thoughts on this gratefully accepted, I may try the motherboard RAID 
if I can't make this work, and it certainly explains why my swapping is 
so slow. That I can switch to RAID-1, it's used mainly for test, big 
data sets and suspend. If I can't make this fast I'd like to understand 
why it's slow.

I did make the raw results 
<http://www.tmr.com/%7Edavidsen/RAID_speed.html> available if people 
want to see more info.
http://www.tmr.com/~davidsen/RAID_speed.html

-- 
Bill Davidsen <davidsen@tmr.com>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Odd (slow) RAID performance
  2006-11-30 14:13 Odd (slow) RAID performance Bill Davidsen
@ 2006-11-30 14:31 ` Roger Lucas
  2006-11-30 15:30   ` Bill Davidsen
  0 siblings, 1 reply; 20+ messages in thread
From: Roger Lucas @ 2006-11-30 14:31 UTC (permalink / raw)
  To: 'Bill Davidsen', linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Bill Davidsen
> Sent: 30 November 2006 14:13
> To: linux-raid@vger.kernel.org
> Subject: Odd (slow) RAID performance
> 
> Pardon if you see this twice, I sent it last night and it never showed
> up...
> 
> I was seeing some bad disk performance on a new install of Fedora Core
> 6, so I did some measurements of write speed, and it would appear that
> write performance is so slow it can't write my data as fast as it is
> generated :-(

What drive configuration are you using (SCSI / ATA / SATA), what chipset is
providing the disk interface and what cpu are you running with?

Thanks,

RL


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-11-30 14:31 ` Roger Lucas
@ 2006-11-30 15:30   ` Bill Davidsen
  2006-11-30 15:32     ` Roger Lucas
  0 siblings, 1 reply; 20+ messages in thread
From: Bill Davidsen @ 2006-11-30 15:30 UTC (permalink / raw)
  To: Roger Lucas; +Cc: linux-raid

Roger Lucas wrote:
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> owner@vger.kernel.org] On Behalf Of Bill Davidsen
>> Sent: 30 November 2006 14:13
>> To: linux-raid@vger.kernel.org
>> Subject: Odd (slow) RAID performance
>>
>> Pardon if you see this twice, I sent it last night and it never showed
>> up...
>>
>> I was seeing some bad disk performance on a new install of Fedora Core
>> 6, so I did some measurements of write speed, and it would appear that
>> write performance is so slow it can't write my data as fast as it is
>> generated :-(
>>     
>
> What drive configuration are you using (SCSI / ATA / SATA), what chipset is
> providing the disk interface and what cpu are you running with?
3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the 
ata-piix driver, with drive cache set to write-back. It's not obvious to 
me why that matters, but if it helps you see the problem I''m glad to 
provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on 
plain stripes, so I'm assuming that either the RAID-5 code is not 
working well or I haven't set it up optimally.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Odd (slow) RAID performance
  2006-11-30 15:30   ` Bill Davidsen
@ 2006-11-30 15:32     ` Roger Lucas
  2006-11-30 21:09       ` Bill Davidsen
  0 siblings, 1 reply; 20+ messages in thread
From: Roger Lucas @ 2006-11-30 15:32 UTC (permalink / raw)
  To: 'Bill Davidsen'; +Cc: linux-raid

> > What drive configuration are you using (SCSI / ATA / SATA), what chipset
> is
> > providing the disk interface and what cpu are you running with?
> 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the
> ata-piix driver, with drive cache set to write-back. It's not obvious to
> me why that matters, but if it helps you see the problem I''m glad to
> provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on
> plain stripes, so I'm assuming that either the RAID-5 code is not
> working well or I haven't set it up optimally.

If it had been ATA, and you had two drives as master+slave on the same
cable, then they would be fast individually but slow as a pair.

RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow then
you would see some degradation from that too.

We have similar hardware here so I'll run some tests here and see what I
get...


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-11-30 15:32     ` Roger Lucas
@ 2006-11-30 21:09       ` Bill Davidsen
  2006-12-01  9:24         ` Roger Lucas
  0 siblings, 1 reply; 20+ messages in thread
From: Bill Davidsen @ 2006-11-30 21:09 UTC (permalink / raw)
  To: Roger Lucas; +Cc: linux-raid

Roger Lucas wrote:
>>> What drive configuration are you using (SCSI / ATA / SATA), what chipset
>>>       
>> is
>>     
>>> providing the disk interface and what cpu are you running with?
>>>       
>> 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the
>> ata-piix driver, with drive cache set to write-back. It's not obvious to
>> me why that matters, but if it helps you see the problem I''m glad to
>> provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on
>> plain stripes, so I'm assuming that either the RAID-5 code is not
>> working well or I haven't set it up optimally.
>>     
>
> If it had been ATA, and you had two drives as master+slave on the same
> cable, then they would be fast individually but slow as a pair.
>
> RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow then
> you would see some degradation from that too.
>
> We have similar hardware here so I'll run some tests here and see what I
> get...

Much appreciated. Since my last note I tried adding --bitmap=internal to 
the array. Bot is that a write performance killer. I will have the chart 
updated in a minute, but write dropped to ~15MB/s with bitmap. Since 
Fedora can't seem to shut the last array down cleanly, I get a rebuild 
on every boot :-( So the array for the LVM has bitmap on, as I hate to 
rebuild 1.5TB regularly. Have to do some compromises on that!

Thanks for looking!

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Odd (slow) RAID performance
  2006-11-30 21:09       ` Bill Davidsen
@ 2006-12-01  9:24         ` Roger Lucas
  2006-12-02  5:27           ` Bill Davidsen
  0 siblings, 1 reply; 20+ messages in thread
From: Roger Lucas @ 2006-12-01  9:24 UTC (permalink / raw)
  To: 'Bill Davidsen'; +Cc: linux-raid

> Roger Lucas wrote:
> >>> What drive configuration are you using (SCSI / ATA / SATA), what
> chipset
> >>>
> >> is
> >>
> >>> providing the disk interface and what cpu are you running with?
> >>>
> >> 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the
> >> ata-piix driver, with drive cache set to write-back. It's not obvious
> to
> >> me why that matters, but if it helps you see the problem I''m glad to
> >> provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on
> >> plain stripes, so I'm assuming that either the RAID-5 code is not
> >> working well or I haven't set it up optimally.
> >>
> >
> > If it had been ATA, and you had two drives as master+slave on the same
> > cable, then they would be fast individually but slow as a pair.
> >
> > RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow
> then
> > you would see some degradation from that too.
> >
> > We have similar hardware here so I'll run some tests here and see what I
> > get...
> 
> Much appreciated. Since my last note I tried adding --bitmap=internal to
> the array. Bot is that a write performance killer. I will have the chart
> updated in a minute, but write dropped to ~15MB/s with bitmap. Since
> Fedora can't seem to shut the last array down cleanly, I get a rebuild
> on every boot :-( So the array for the LVM has bitmap on, as I hate to
> rebuild 1.5TB regularly. Have to do some compromises on that!
> 

Hi Bill,

Here are the results of my tests here:

	CPU: Intel Celetron 2.7GHz socket 775
	MB:  Abit LG-81 (Lakeport ICH7 chipset)
	HDD: 4 x Seagate SATA ST3160812AS (directly connected to ICH7)
	OS:  Linux 2.6.16-xen

root@hydra:~# uname -a
Linux hydra 2.6.16-xen #1 SMP Thu Apr 13 18:46:07 BST 2006 i686 GNU/Linux
root@hydra:~#

All four disks are built into a RAID-5 array to provide ~420GB real storage.
Most of this is then used by the other Xen virtual machines but there is a
bit of space left on this server to play with in the Dom-0.

I wasn't able to run I/O tests with "dd" on the disks themselves as I don't
have a spare partition to corrupt, but hdparm gives:

root@hydra:~# hdparm -tT /dev/sda

/dev/sda:
 Timing cached reads:   3296 MB in  2.00 seconds = 1648.48 MB/sec
 Timing buffered disk reads:  180 MB in  3.01 seconds =  59.78 MB/sec
root@hydra:~#

Which is exactly what I would expect as this is the performance limit of the
disk.  We have a lot of ICH7/ICH7R-based servers here and all can run the
disk at their maximum physical speed without problems.

root@hydra:~# cat /proc/mdstat
Personalities : [raid5] [raid4]
md0 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      468647808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>
root@hydra:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/bigraid-root
                       10G  1.3G  8.8G  13% /
<snip>
root@hydra:~# vgs
  VG      #PV #LV #SN Attr   VSize   VFree
  bigraid   1  13   0 wz--n- 446.93G 11.31G
root@hydra:~# lvcreate --name testspeed --size 2G bigraid
  Logical volume "testspeed" created
root@hydra:~#

*** Now for the LVM over RAID-5 read/write tests ***

root@hydra:~# sync; time bash -c "dd if=/dev/zero bs=1024k count=2048
of=/dev/bigraid/testspeed; sync"
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 33.7345 seconds, 63.7 MB/s

real    0m34.211s
user    0m0.020s
sys     0m2.970s
root@hydra:~# sync; time bash -c "dd of=/dev/zero bs=1024k count=2048
if=/dev/bigraid/testspeed; sync"
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 38.1175 seconds, 56.3 MB/s

real    0m38.637s
user    0m0.010s
sys     0m3.260s
root@hydra:~#

During the above two tests, the CPU showed about 35% idle using "top".

*** Now for the file system read/write tests ***
   (Reiser over LVM over RAID-5)

root@hydra:~# mount
/dev/mapper/bigraid-root on / type reiserfs (rw)
<snip>
root@hydra:~#


root@hydra:~# sync; time bash -c "dd if=/dev/zero bs=1024k count=2048
of=~/testspeed; sync"
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 29.8863 seconds, 71.9 MB/s

real    0m32.289s
user    0m0.000s
sys     0m4.440s
root@hydra:~# sync; time bash -c "dd of=/dev/null bs=1024k count=2048
if=~/testspeed; sync"
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 40.332 seconds, 53.2 MB/s

real    0m40.973s
user    0m0.010s
sys     0m2.640s
root@hydra:~#

During the above two tests, the CPU showed between 0% and 30% idle using
"top".

Just for curiousity, I started the RAID-5 check process to see what load it
generated...

root@hydra:~# cat /sys/block/md0/md/mismatch_cnt
0
root@hydra:~# echo check > /sys/block/md0/md/sync_action
root@hydra:~# cat /sys/block/md0/md/sync_action
check
root@hydra:~# cat /proc/mdstat
Personalities : [raid5] [raid4]
md0 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      468647808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      [>....................]  resync =  1.0% (1671552/156215936)
finish=101.8min speed=25292K/sec

unused devices: <none>
root@hydra:~#

Whilst the above test was running, the CPU load was between 3% and 7%, so
running the RAID array isn't that hard for it...

-------------------------

So, using a 4-disk RAID-5 array with an ICH7, I get about 64M write and 54MB
read prformance.  The processor is about 35% idle whilst the test is running
- I'm not sure why this is, I would have expected the processor load to be
0% idle as it should be hitting the hard disk as fast as possible and
waiting for it otherwise....

If I run over Reiser, the processor load changes a lot more, varying between
0% and 35% idle.  It also takes a couple of seconds after the test has
finished before the load drops down to zero on the write test, so I suspect
these results are basically the same as the raw LVM-over-RAID5 performance.

Summary - it is a little faster with 4 disks rather than the 37.5 MB/s that
you have with just the three, but it is WAY off the theoretical target of
3x60MB = 180MB that could be expected given that you are running a 4-disk
RAID-5 array.
 
On the flip side, the performance is good enough for me, so it is not
causing me a problem, but it seems that there should be a performance boost
available somewhere!

Best regards,

Roger


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-01  9:24         ` Roger Lucas
@ 2006-12-02  5:27           ` Bill Davidsen
  2006-12-05  1:33             ` Dan Williams
  0 siblings, 1 reply; 20+ messages in thread
From: Bill Davidsen @ 2006-12-02  5:27 UTC (permalink / raw)
  To: Roger Lucas; +Cc: linux-raid, neilb

Roger Lucas wrote:
>> Roger Lucas wrote:
>>>>> What drive configuration are you using (SCSI / ATA / SATA), what
>> chipset
>>>> is
>>>>
>>>>> providing the disk interface and what cpu are you running with?
>>>>>
>>>> 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the
>>>> ata-piix driver, with drive cache set to write-back. It's not obvious
>> to
>>>> me why that matters, but if it helps you see the problem I''m glad to
>>>> provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on
>>>> plain stripes, so I'm assuming that either the RAID-5 code is not
>>>> working well or I haven't set it up optimally.
>>>>
>>> If it had been ATA, and you had two drives as master+slave on the same
>>> cable, then they would be fast individually but slow as a pair.
>>>
>>> RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow
>> then
>>> you would see some degradation from that too.
>>>
>>> We have similar hardware here so I'll run some tests here and see what I
>>> get...
>> Much appreciated. Since my last note I tried adding --bitmap=internal to
>> the array. Bot is that a write performance killer. I will have the chart
>> updated in a minute, but write dropped to ~15MB/s with bitmap. Since
>> Fedora can't seem to shut the last array down cleanly, I get a rebuild
>> on every boot :-( So the array for the LVM has bitmap on, as I hate to
>> rebuild 1.5TB regularly. Have to do some compromises on that!
>>
> 
> Hi Bill,
> 
> Here are the results of my tests here:
> 
> 	CPU: Intel Celetron 2.7GHz socket 775
> 	MB:  Abit LG-81 (Lakeport ICH7 chipset)
> 	HDD: 4 x Seagate SATA ST3160812AS (directly connected to ICH7)
> 	OS:  Linux 2.6.16-xen
> 
> root@hydra:~# uname -a
> Linux hydra 2.6.16-xen #1 SMP Thu Apr 13 18:46:07 BST 2006 i686 GNU/Linux
> root@hydra:~#
> 
> All four disks are built into a RAID-5 array to provide ~420GB real storage.
> Most of this is then used by the other Xen virtual machines but there is a
> bit of space left on this server to play with in the Dom-0.
> 
> I wasn't able to run I/O tests with "dd" on the disks themselves as I don't
> have a spare partition to corrupt, but hdparm gives:
> 
> root@hydra:~# hdparm -tT /dev/sda
> 
> /dev/sda:
>  Timing cached reads:   3296 MB in  2.00 seconds = 1648.48 MB/sec
>  Timing buffered disk reads:  180 MB in  3.01 seconds =  59.78 MB/sec
> root@hydra:~#
> 
> Which is exactly what I would expect as this is the performance limit of the
> disk.  We have a lot of ICH7/ICH7R-based servers here and all can run the
> disk at their maximum physical speed without problems.
> 
> root@hydra:~# cat /proc/mdstat
> Personalities : [raid5] [raid4]
> md0 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1]
>       468647808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
> 
> unused devices: <none>
> root@hydra:~# df -h
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/bigraid-root
>                        10G  1.3G  8.8G  13% /
> <snip>
> root@hydra:~# vgs
>   VG      #PV #LV #SN Attr   VSize   VFree
>   bigraid   1  13   0 wz--n- 446.93G 11.31G
> root@hydra:~# lvcreate --name testspeed --size 2G bigraid
>   Logical volume "testspeed" created
> root@hydra:~#
> 
> *** Now for the LVM over RAID-5 read/write tests ***
> 
> root@hydra:~# sync; time bash -c "dd if=/dev/zero bs=1024k count=2048
> of=/dev/bigraid/testspeed; sync"
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 33.7345 seconds, 63.7 MB/s
> 
> real    0m34.211s
> user    0m0.020s
> sys     0m2.970s
> root@hydra:~# sync; time bash -c "dd of=/dev/zero bs=1024k count=2048
> if=/dev/bigraid/testspeed; sync"
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 38.1175 seconds, 56.3 MB/s
> 
> real    0m38.637s
> user    0m0.010s
> sys     0m3.260s
> root@hydra:~#
> 
> During the above two tests, the CPU showed about 35% idle using "top".
> 
> *** Now for the file system read/write tests ***
>    (Reiser over LVM over RAID-5)
> 
> root@hydra:~# mount
> /dev/mapper/bigraid-root on / type reiserfs (rw)
> <snip>
> root@hydra:~#
> 
> 
> root@hydra:~# sync; time bash -c "dd if=/dev/zero bs=1024k count=2048
> of=~/testspeed; sync"
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 29.8863 seconds, 71.9 MB/s
> 
> real    0m32.289s
> user    0m0.000s
> sys     0m4.440s
> root@hydra:~# sync; time bash -c "dd of=/dev/null bs=1024k count=2048
> if=~/testspeed; sync"
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 40.332 seconds, 53.2 MB/s
> 
> real    0m40.973s
> user    0m0.010s
> sys     0m2.640s
> root@hydra:~#
> 
> During the above two tests, the CPU showed between 0% and 30% idle using
> "top".
> 
> Just for curiousity, I started the RAID-5 check process to see what load it
> generated...
> 
> root@hydra:~# cat /sys/block/md0/md/mismatch_cnt
> 0
> root@hydra:~# echo check > /sys/block/md0/md/sync_action
> root@hydra:~# cat /sys/block/md0/md/sync_action
> check
> root@hydra:~# cat /proc/mdstat
> Personalities : [raid5] [raid4]
> md0 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1]
>       468647808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>       [>....................]  resync =  1.0% (1671552/156215936)
> finish=101.8min speed=25292K/sec
> 
> unused devices: <none>
> root@hydra:~#
> 
> Whilst the above test was running, the CPU load was between 3% and 7%, so
> running the RAID array isn't that hard for it...
> 
> -------------------------
> 
> So, using a 4-disk RAID-5 array with an ICH7, I get about 64M write and 54MB
> read prformance.  The processor is about 35% idle whilst the test is running
> - I'm not sure why this is, I would have expected the processor load to be
> 0% idle as it should be hitting the hard disk as fast as possible and
> waiting for it otherwise....
> 
> If I run over Reiser, the processor load changes a lot more, varying between
> 0% and 35% idle.  It also takes a couple of seconds after the test has
> finished before the load drops down to zero on the write test, so I suspect
> these results are basically the same as the raw LVM-over-RAID5 performance.
> 
> Summary - it is a little faster with 4 disks rather than the 37.5 MB/s that
> you have with just the three, but it is WAY off the theoretical target of
> 3x60MB = 180MB that could be expected given that you are running a 4-disk
> RAID-5 array.
>  
> On the flip side, the performance is good enough for me, so it is not
> causing me a problem, but it seems that there should be a performance boost
> available somewhere!
> 
> Best regards,
> 
> Roger

Thank you so much for verifying this. I do keep enough room on my drives 
to run tests by creating any kind of whatever I need, but the point is 
clear: with N drives striped the transfer rate is N x base rate of one 
drive; with RAID-5 it is about the speed of one drive, suggesting that 
the md code serializes writes.

If true, BOO, HISS!

Can you explain and educate us, Neal? This look like terrible performance.

-- 
Bill Davidsen
   He was a full-time professional cat, not some moonlighting
ferret or weasel. He knew about these things.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-02  5:27           ` Bill Davidsen
@ 2006-12-05  1:33             ` Dan Williams
  2006-12-07 15:51               ` Bill Davidsen
  2006-12-08  6:01               ` Neil Brown
  0 siblings, 2 replies; 20+ messages in thread
From: Dan Williams @ 2006-12-05  1:33 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Roger Lucas, linux-raid, neilb

On 12/1/06, Bill Davidsen <davidsen@tmr.com> wrote:
> Thank you so much for verifying this. I do keep enough room on my drives
> to run tests by creating any kind of whatever I need, but the point is
> clear: with N drives striped the transfer rate is N x base rate of one
> drive; with RAID-5 it is about the speed of one drive, suggesting that
> the md code serializes writes.
>
> If true, BOO, HISS!
>
> Can you explain and educate us, Neal? This look like terrible performance.
>
Just curious what is your stripe_cache_size setting in sysfs?

Neil, please include me in the education if what follows is incorrect:

Read performance in kernels up to and including 2.6.19 is hindered by
needing to go through the stripe cache.  This situation should improve
with the stripe-cache-bypass patches currently in -mm.  As Raz
reported in some cases the performance increase of this approach is
30% which is roughly equivalent to the performance difference I see of
a 4-disk raid5 versus a 3-disk raid0.

For the write case I can say that MD does not serialize writes.  If by
serialize you mean that there is 1:1 correlation between writes to the
parity disk and writes to a data disk.  To illustrate I instrumented
MD to count how many times it issued a write to the parity disk and
compared that to how many writes it performed to the member disks for
the workload "dd if=/dev/zero of=/dev/md0 bs=1024k count=100".  I
recorded 8544 parity writes and 25600 member disk writes which is
about 3 member disk writes per parity write, or pretty close to
optimal for a 4-disk array.  So, serialization is not the cause,
performing sub-stripe width writes is not the cause as >98% of the
writes happened without needing to read old data from the disks.
However, I see the same performance on my system, about equal to a
single disk.

Here is where I step into supposition territory.  Perhaps the
discrepancy is related to the size of the requests going to the block
layer.  raid5 always makes page sized requests with the expectation
that they will coalesce into larger requests in the block layer.
Maybe we are missing coalescing opportunities in raid5 compared to
what happens in the raid0 case?  Are there any io scheduler knobs to
turn along these lines?

Dan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-05  1:33             ` Dan Williams
@ 2006-12-07 15:51               ` Bill Davidsen
  2006-12-08  1:15                 ` Corey Hickey
  2006-12-08  8:21                 ` Gabor Gombas
  2006-12-08  6:01               ` Neil Brown
  1 sibling, 2 replies; 20+ messages in thread
From: Bill Davidsen @ 2006-12-07 15:51 UTC (permalink / raw)
  To: Dan Williams; +Cc: Roger Lucas, linux-raid, neilb

Dan Williams wrote:
> On 12/1/06, Bill Davidsen <davidsen@tmr.com> wrote:
>> Thank you so much for verifying this. I do keep enough room on my drives
>> to run tests by creating any kind of whatever I need, but the point is
>> clear: with N drives striped the transfer rate is N x base rate of one
>> drive; with RAID-5 it is about the speed of one drive, suggesting that
>> the md code serializes writes.
>>
>> If true, BOO, HISS!
>>
>> Can you explain and educate us, Neal? This look like terrible 
>> performance.
>>
> Just curious what is your stripe_cache_size setting in sysfs?
> 
> Neil, please include me in the education if what follows is incorrect:
> 
> Read performance in kernels up to and including 2.6.19 is hindered by
> needing to go through the stripe cache.  This situation should improve
> with the stripe-cache-bypass patches currently in -mm.  As Raz
> reported in some cases the performance increase of this approach is
> 30% which is roughly equivalent to the performance difference I see of
> a 4-disk raid5 versus a 3-disk raid0.
> 
> For the write case I can say that MD does not serialize writes.  If by
> serialize you mean that there is 1:1 correlation between writes to the
> parity disk and writes to a data disk.  To illustrate I instrumented
> MD to count how many times it issued a write to the parity disk and
> compared that to how many writes it performed to the member disks for
> the workload "dd if=/dev/zero of=/dev/md0 bs=1024k count=100".  I
> recorded 8544 parity writes and 25600 member disk writes which is
> about 3 member disk writes per parity write, or pretty close to
> optimal for a 4-disk array.  So, serialization is not the cause,
> performing sub-stripe width writes is not the cause as >98% of the
> writes happened without needing to read old data from the disks.
> However, I see the same performance on my system, about equal to a
> single disk.

But the number of writes isn't an indication of serialization. If I 
write disk A, then B, then C, then D, you can't tell if I waited for 
each write to finish before starting the next, or did them in parallel. 
And since the write speed is equal to the speed of a single drive, 
effectively that's what happens, even though I can't see it in the code.

I also suspect that write are not being combined, since writing the 2GB 
test runs at one-drive speed writing 1MB blocks, but floppy speed 
writing 2k blocks. And no, I'm not running out of CPU to do the 
overhead, it jumps from 2-4% to 30% of one CPU, but on an unloaded SMP 
system it's not CPU bound.
> 
> Here is where I step into supposition territory.  Perhaps the
> discrepancy is related to the size of the requests going to the block
> layer.  raid5 always makes page sized requests with the expectation
> that they will coalesce into larger requests in the block layer.
> Maybe we are missing coalescing opportunities in raid5 compared to
> what happens in the raid0 case?  Are there any io scheduler knobs to
> turn along these lines?

Good thought, I had already tried that but not reported it, changing 
schedulers make no significant difference. In the range of 2-3%, which 
is close to the measurement jitter due to head position or whatever.

I changed my swap to RAID-10, but RAID-5 just can't keep up with 
70-100MB/s data bursts which I need. I'm probably going to scrap 
software RAID and go back to a controller, the write speeds are simply 
not even close to what they should be. I have one more thing to try, a 
tool I wrote to chase another problem a few years ago. I'll report if I 
find something.

-- 
bill davidsen <davidsen@tmr.com>
   CTO TMR Associates, Inc
   Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-07 15:51               ` Bill Davidsen
@ 2006-12-08  1:15                 ` Corey Hickey
  2006-12-08  8:21                 ` Gabor Gombas
  1 sibling, 0 replies; 20+ messages in thread
From: Corey Hickey @ 2006-12-08  1:15 UTC (permalink / raw)
  To: linux-raid

Bill Davidsen wrote:
> Dan Williams wrote:
>> On 12/1/06, Bill Davidsen <davidsen@tmr.com> wrote:
>>> Thank you so much for verifying this. I do keep enough room on my drives
>>> to run tests by creating any kind of whatever I need, but the point is
>>> clear: with N drives striped the transfer rate is N x base rate of one
>>> drive; with RAID-5 it is about the speed of one drive, suggesting that
>>> the md code serializes writes.
>>>
>>> If true, BOO, HISS!
>>>
>>> Can you explain and educate us, Neal? This look like terrible 
>>> performance.
>>>
>> Just curious what is your stripe_cache_size setting in sysfs?
>>
>> Neil, please include me in the education if what follows is incorrect:
>>
>> Read performance in kernels up to and including 2.6.19 is hindered by
>> needing to go through the stripe cache.  This situation should improve
>> with the stripe-cache-bypass patches currently in -mm.  As Raz
>> reported in some cases the performance increase of this approach is
>> 30% which is roughly equivalent to the performance difference I see of
>> a 4-disk raid5 versus a 3-disk raid0.
>>
>> For the write case I can say that MD does not serialize writes.  If by
>> serialize you mean that there is 1:1 correlation between writes to the
>> parity disk and writes to a data disk.  To illustrate I instrumented
>> MD to count how many times it issued a write to the parity disk and
>> compared that to how many writes it performed to the member disks for
>> the workload "dd if=/dev/zero of=/dev/md0 bs=1024k count=100".  I
>> recorded 8544 parity writes and 25600 member disk writes which is
>> about 3 member disk writes per parity write, or pretty close to
>> optimal for a 4-disk array.  So, serialization is not the cause,
>> performing sub-stripe width writes is not the cause as >98% of the
>> writes happened without needing to read old data from the disks.
>> However, I see the same performance on my system, about equal to a
>> single disk.
> 
> But the number of writes isn't an indication of serialization. If I 
> write disk A, then B, then C, then D, you can't tell if I waited for 
> each write to finish before starting the next, or did them in parallel. 
> And since the write speed is equal to the speed of a single drive, 
> effectively that's what happens, even though I can't see it in the code.

For what it's worth, my read and write speeds on a 5-disk RAID-5 are 
somewhat faster than the speed of any single drive. The array is a 
mixture of two different SATA drives and one IDE drive.

Sustained individual read performances range from 56 MB/sec for the IDE 
drive to 68 MB/sec for the faster SATA drives. I can read from the 
RAID-5 at about 100MB/sec.

I can't give precise numbers for write speeds, except to say that I can 
write to a file on the filesystem (which is mostly full and probably 
somewhat fragmented) at about 83 MB/sec.

None of those numbers are equal to the theoretical maximum performance, 
so I see your point, but they're still faster than one individual disk.

> I also suspect that write are not being combined, since writing the 2GB 
> test runs at one-drive speed writing 1MB blocks, but floppy speed 
> writing 2k blocks. And no, I'm not running out of CPU to do the 
> overhead, it jumps from 2-4% to 30% of one CPU, but on an unloaded SMP 
> system it's not CPU bound.
>>
>> Here is where I step into supposition territory.  Perhaps the
>> discrepancy is related to the size of the requests going to the block
>> layer.  raid5 always makes page sized requests with the expectation
>> that they will coalesce into larger requests in the block layer.
>> Maybe we are missing coalescing opportunities in raid5 compared to
>> what happens in the raid0 case?  Are there any io scheduler knobs to
>> turn along these lines?
> 
> Good thought, I had already tried that but not reported it, changing 
> schedulers make no significant difference. In the range of 2-3%, which 
> is close to the measurement jitter due to head position or whatever.
> 
> I changed my swap to RAID-10, but RAID-5 just can't keep up with 
> 70-100MB/s data bursts which I need. I'm probably going to scrap 
> software RAID and go back to a controller, the write speeds are simply 
> not even close to what they should be. I have one more thing to try, a 
> tool I wrote to chase another problem a few years ago. I'll report if I 
> find something.

I have read that using RAID to stripe swap space is ill-advised, or at 
least unnecessary. The kernel will stripe multiple swap devices if you 
assign them the same priority.
http://tldp.org/HOWTO/Software-RAID-HOWTO-2.html

If you've been using RAID-10 for swap, then I think you could just 
assign multiple RAID-1 devices the same swap priority for the same 
effect with (perhaps) less overhead.

-Corey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-05  1:33             ` Dan Williams
  2006-12-07 15:51               ` Bill Davidsen
@ 2006-12-08  6:01               ` Neil Brown
  2006-12-08  7:28                 ` Neil Brown
  2006-12-09 20:16                 ` Bill Davidsen
  1 sibling, 2 replies; 20+ messages in thread
From: Neil Brown @ 2006-12-08  6:01 UTC (permalink / raw)
  To: Dan Williams; +Cc: Bill Davidsen, Roger Lucas, linux-raid

On Monday December 4, dan.j.williams@gmail.com wrote:
> 
> Here is where I step into supposition territory.  Perhaps the
> discrepancy is related to the size of the requests going to the block
> layer.  raid5 always makes page sized requests with the expectation
> that they will coalesce into larger requests in the block layer.
> Maybe we are missing coalescing opportunities in raid5 compared to
> what happens in the raid0 case?  Are there any io scheduler knobs to
> turn along these lines?

This can be measured.  /proc/diskstats reports the number of requests
as well as the number of sectors.
The number of write requests is column 8.  The number of write sectors
is column 10.  Comparing these you can get an average request size.

I have found that the average request size is proportional to the size
of the stripe cache (roughly, with limits) but increasing it doesn't
increase through put.
I have measured very slow write throughput for raid5 as well, though
2.6.18 does seem to have the same problem.  I'll double check and do a
git bisect and see what I can come up with.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-08  6:01               ` Neil Brown
@ 2006-12-08  7:28                 ` Neil Brown
  2006-12-09 20:20                   ` Bill Davidsen
  2006-12-12 17:44                   ` Bill Davidsen
  2006-12-09 20:16                 ` Bill Davidsen
  1 sibling, 2 replies; 20+ messages in thread
From: Neil Brown @ 2006-12-08  7:28 UTC (permalink / raw)
  To: Dan Williams, Bill Davidsen, Roger Lucas, linux-raid

On Friday December 8, neilb@suse.de wrote:
> I have measured very slow write throughput for raid5 as well, though
> 2.6.18 does seem to have the same problem.  I'll double check and do a
> git bisect and see what I can come up with.

Correction... it isn't 2.6.18 that fixes the problem.  It is compiling
without LOCKDEP or PROVE_LOCKING.  I remove those and suddenly a
3 drive raid5 is faster than a single drive rather than much slower.

Bill:  Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ??

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-07 15:51               ` Bill Davidsen
  2006-12-08  1:15                 ` Corey Hickey
@ 2006-12-08  8:21                 ` Gabor Gombas
  1 sibling, 0 replies; 20+ messages in thread
From: Gabor Gombas @ 2006-12-08  8:21 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Dan Williams, Roger Lucas, linux-raid, neilb

On Thu, Dec 07, 2006 at 10:51:25AM -0500, Bill Davidsen wrote:

> I also suspect that write are not being combined, since writing the 2GB 
> test runs at one-drive speed writing 1MB blocks, but floppy speed 
> writing 2k blocks. And no, I'm not running out of CPU to do the 
> overhead, it jumps from 2-4% to 30% of one CPU, but on an unloaded SMP 
> system it's not CPU bound.

You could use blktrace to see the actual requests that the md code sends
down to the device, including request merging actions. That may provide
more insight into what really happens.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-08  6:01               ` Neil Brown
  2006-12-08  7:28                 ` Neil Brown
@ 2006-12-09 20:16                 ` Bill Davidsen
  1 sibling, 0 replies; 20+ messages in thread
From: Bill Davidsen @ 2006-12-09 20:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: Dan Williams, Roger Lucas, linux-raid

Neil Brown wrote:
> On Monday December 4, dan.j.williams@gmail.com wrote:
>   
>> Here is where I step into supposition territory.  Perhaps the
>> discrepancy is related to the size of the requests going to the block
>> layer.  raid5 always makes page sized requests with the expectation
>> that they will coalesce into larger requests in the block layer.
>> Maybe we are missing coalescing opportunities in raid5 compared to
>> what happens in the raid0 case?  Are there any io scheduler knobs to
>> turn along these lines?
>>     
>
> This can be measured.  /proc/diskstats reports the number of requests
> as well as the number of sectors.
> The number of write requests is column 8.  The number of write sectors
> is column 10.  Comparing these you can get an average request size.
>
> I have found that the average request size is proportional to the size
> of the stripe cache (roughly, with limits) but increasing it doesn't
> increase through put.
> I have measured very slow write throughput for raid5 as well, though
> 2.6.18 does seem to have the same problem.  I'll double check and do a
> git bisect and see what I can come up with.
>
> NeilBrown
Agreed, this is an ongoing problem, not a regression in 2.6.19. But I am 
writing 50MB/s to a single drive, 3x that to a three way RAID-0 array of 
those drives, and only 35MB/s to a three drive RAID-5 array. With large 
writes I know no reread is needed, and yet I get consistently slow 
write, which gets worse with smaller data writes (2k vs. 1MB for the 
original test).

Read performance is good, I will measure tomorrow and quantify "good," 
today is shot from ten minutes from now until ~2am, as I have a party to 
attend, followed by a 'cast to watch.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-08  7:28                 ` Neil Brown
@ 2006-12-09 20:20                   ` Bill Davidsen
  2006-12-12 17:44                   ` Bill Davidsen
  1 sibling, 0 replies; 20+ messages in thread
From: Bill Davidsen @ 2006-12-09 20:20 UTC (permalink / raw)
  To: Neil Brown; +Cc: Dan Williams, Roger Lucas, linux-raid

Neil Brown wrote:
> On Friday December 8, neilb@suse.de wrote:
>   
>> I have measured very slow write throughput for raid5 as well, though
>> 2.6.18 does seem to have the same problem.  I'll double check and do a
>> git bisect and see what I can come up with.
>>     
>
> Correction... it isn't 2.6.18 that fixes the problem.  It is compiling
> without LOCKDEP or PROVE_LOCKING.  I remove those and suddenly a
> 3 drive raid5 is faster than a single drive rather than much slower.
>
> Bill:  Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ??
>   
I have to check tomorrow, I'm using the Fedora kernel (as noted in the 
first post on this) rather than one I built, just so others could verify 
my results as several have been kind enough to do. Have to run, but I 
will check tomorrow or Monday morning early at the latest.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-08  7:28                 ` Neil Brown
  2006-12-09 20:20                   ` Bill Davidsen
@ 2006-12-12 17:44                   ` Bill Davidsen
  2006-12-12 18:48                     ` Raz Ben-Jehuda(caro)
  1 sibling, 1 reply; 20+ messages in thread
From: Bill Davidsen @ 2006-12-12 17:44 UTC (permalink / raw)
  To: Neil Brown; +Cc: Dan Williams, Roger Lucas, linux-raid

Neil Brown wrote:
> On Friday December 8, neilb@suse.de wrote:
>   
>> I have measured very slow write throughput for raid5 as well, though
>> 2.6.18 does seem to have the same problem.  I'll double check and do a
>> git bisect and see what I can come up with.
>>     
>
> Correction... it isn't 2.6.18 that fixes the problem.  It is compiling
> without LOCKDEP or PROVE_LOCKING.  I remove those and suddenly a
> 3 drive raid5 is faster than a single drive rather than much slower.
>
> Bill:  Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ??

YES and NO respectively. I did try increasing the stripe_cache_size and 
got better but not anywhere near max performance, perhaps the 
PROVE_LOCKING is still at fault, although performance of RAID-0 is as 
expected, so I'm dubious. In any case, by pushing the size from 256 to 
1024, 4096, and finally 10240 I was able to raise the speed to 82MB/s, 
which is right at the edge of what I need. I want to read the doc on 
stripe_cache_size before going huge, if that's K 10MB is a LOT of cache 
when 256 works perfectly in RAID-0.

I noted that the performance really was bad using 2k write, before 
increasing the stripe_cache, I will repeat that after doing some other 
"real work" things.

Any additional input appreciated, I would expect the speed to be (Ndisk 
- 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't 
makes me suspect there's unintended serialization or buffering, even 
when not need (and NOT wanted).

Thanks for the feedback, I'm updating the files as I type.
http://www.tmr.com/~davidsen/RAID_speed
http://www.tmr.com/~davidsen/FC6-config

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-12 17:44                   ` Bill Davidsen
@ 2006-12-12 18:48                     ` Raz Ben-Jehuda(caro)
  2006-12-12 21:51                       ` Bill Davidsen
  0 siblings, 1 reply; 20+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-12-12 18:48 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Roger Lucas, linux-raid

On 12/12/06, Bill Davidsen <davidsen@tmr.com> wrote:
> Neil Brown wrote:
> > On Friday December 8, neilb@suse.de wrote:
> >
> >> I have measured very slow write throughput for raid5 as well, though
> >> 2.6.18 does seem to have the same problem.  I'll double check and do a
> >> git bisect and see what I can come up with.
> >>
> >
> > Correction... it isn't 2.6.18 that fixes the problem.  It is compiling
> > without LOCKDEP or PROVE_LOCKING.  I remove those and suddenly a
> > 3 drive raid5 is faster than a single drive rather than much slower.
> >
> > Bill:  Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ??
>
> YES and NO respectively. I did try increasing the stripe_cache_size and
> got better but not anywhere near max performance, perhaps the
> PROVE_LOCKING is still at fault, although performance of RAID-0 is as
> expected, so I'm dubious. In any case, by pushing the size from 256 to
> 1024, 4096, and finally 10240 I was able to raise the speed to 82MB/s,
> which is right at the edge of what I need. I want to read the doc on
> stripe_cache_size before going huge, if that's K 10MB is a LOT of cache
> when 256 works perfectly in RAID-0.
>
> I noted that the performance really was bad using 2k write, before
> increasing the stripe_cache, I will repeat that after doing some other
> "real work" things.
>
> Any additional input appreciated, I would expect the speed to be (Ndisk
> - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't
> makes me suspect there's unintended serialization or buffering, even
> when not need (and NOT wanted).
>
> Thanks for the feedback, I'm updating the files as I type.
> http://www.tmr.com/~davidsen/RAID_speed
> http://www.tmr.com/~davidsen/FC6-config
>
> --
> bill davidsen <davidsen@tmr.com>
>  CTO TMR Associates, Inc
>  Doing interesting things with small computers since 1979
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Bill helllo
I have been working on raid5 performance write throughout.
The whole idea is the access pattern.
One should  buffers with respect to the size of stripe.
this way you will be able to eiliminate the undesired reads.
By accessing it correctly I have managed reach a write
throughout with respect to the number of disks in the raid.


-- 
Raz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-12 18:48                     ` Raz Ben-Jehuda(caro)
@ 2006-12-12 21:51                       ` Bill Davidsen
  2006-12-13 17:44                         ` Mark Hahn
  0 siblings, 1 reply; 20+ messages in thread
From: Bill Davidsen @ 2006-12-12 21:51 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Roger Lucas, linux-raid

Raz Ben-Jehuda(caro) wrote:
> On 12/12/06, Bill Davidsen <davidsen@tmr.com> wrote:
>> Neil Brown wrote:
>> > On Friday December 8, neilb@suse.de wrote:
>> >
>> >> I have measured very slow write throughput for raid5 as well, though
>> >> 2.6.18 does seem to have the same problem.  I'll double check and 
>> do a
>> >> git bisect and see what I can come up with.
>> >>
>> >
>> > Correction... it isn't 2.6.18 that fixes the problem.  It is compiling
>> > without LOCKDEP or PROVE_LOCKING.  I remove those and suddenly a
>> > 3 drive raid5 is faster than a single drive rather than much slower.
>> >
>> > Bill:  Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ??
>>
>> YES and NO respectively. I did try increasing the stripe_cache_size and
>> got better but not anywhere near max performance, perhaps the
>> PROVE_LOCKING is still at fault, although performance of RAID-0 is as
>> expected, so I'm dubious. In any case, by pushing the size from 256 to
>> 1024, 4096, and finally 10240 I was able to raise the speed to 82MB/s,
>> which is right at the edge of what I need. I want to read the doc on
>> stripe_cache_size before going huge, if that's K 10MB is a LOT of cache
>> when 256 works perfectly in RAID-0.
>>
>> I noted that the performance really was bad using 2k write, before
>> increasing the stripe_cache, I will repeat that after doing some other
>> "real work" things.
>>
>> Any additional input appreciated, I would expect the speed to be (Ndisk
>> - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't
>> makes me suspect there's unintended serialization or buffering, even
>> when not need (and NOT wanted).
>>
>> Thanks for the feedback, I'm updating the files as I type.
>> http://www.tmr.com/~davidsen/RAID_speed
>> http://www.tmr.com/~davidsen/FC6-config
>>
>> -- 
>> bill davidsen <davidsen@tmr.com>
>>  CTO TMR Associates, Inc
>>  Doing interesting things with small computers since 1979
>
> Bill helllo
> I have been working on raid5 performance write throughout.
> The whole idea is the access pattern.
> One should  buffers with respect to the size of stripe.
> this way you will be able to eiliminate the undesired reads.
> By accessing it correctly I have managed reach a write
> throughout with respect to the number of disks in the raid.
>
>
I'm doing the tests writing 2GB of data to the raw array, in 1MB writes. 
The array is RAID-5 with 256 chunk size. I wouldn't really expect any 
reads, unless I totally misunderstand how all those numbers work 
together. I was really trying to avoid any issues there.However, the 
only other size I have tried was 2K blocks, so I can try other sizes. I 
have a hard time picturing why smaller sizes would be better, but that's 
what testing is for.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-12 21:51                       ` Bill Davidsen
@ 2006-12-13 17:44                         ` Mark Hahn
  2006-12-20  4:05                           ` Bill Davidsen
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Hahn @ 2006-12-13 17:44 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-raid

>>> which is right at the edge of what I need. I want to read the doc on
>>> stripe_cache_size before going huge, if that's K 10MB is a LOT of cache
>>> when 256 works perfectly in RAID-0.

but they are basically unrelated.  in r5/6, the stripe cache is absolutely
critical in caching parity chunks.  in r0, never functions this way, though
it may help some workloads a bit (IOs which aren't naturally aligned to 
the underlying disk layout.)

>>> Any additional input appreciated, I would expect the speed to be (Ndisk
>>> - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't

as others have reported, you can actually approach that with "naturally"
aligned and sized writes.

> I'm doing the tests writing 2GB of data to the raw array, in 1MB writes. The 
> array is RAID-5 with 256 chunk size. I wouldn't really expect any reads,

but how many disks?  if your 1M writes are to 4 data disks, you 
stand a chance of streaming (assuming your writes are naturally 
aligned, or else you'll be somewhat dependent on the stripe cache.)
in other words, your whole-stripe size is ndisks*chunksize, and for 
256K chunks and, say, 14 disks, that's pretty monstrous...

I think that's a factor often overlooked - large chunk sizes, especially
with r5/6 AND lots of disks, mean you probably won't ever do "blind" 
updates, and thus need the r/m/w cycle.  in that case, if the stripe cache
is not big/smart enough, you'll be limited by reads.

I'd like to experiment with this, to see how much benefit you 
really get from using larger chunk sizes.  I'm guessing that past 32K
or so, normal *ata systems don't speedup much.  fabrics with higher 
latency or command/arbitration overhead would want larger chunks.

> tried was 2K blocks, so I can try other sizes. I have a hard time picturing 
> why smaller sizes would be better, but that's what testing is for.

larger writes (from user-space) generally help, probably up to MB's.
smaller chunks help by making it more likley to do blind parity updates;
a larger stripe cache can help that too.

I think I recall an earlier thread regarding how the stripe cache is used
somewhat naively - that all IO goes through it.  the most important 
blocks would be parity and "ends" of a write that partially update an 
underlying chunk.  (conversely, don't bother caching anything which 
can be blindly written to disk.)

regards, mark hahn.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd (slow) RAID performance
  2006-12-13 17:44                         ` Mark Hahn
@ 2006-12-20  4:05                           ` Bill Davidsen
  0 siblings, 0 replies; 20+ messages in thread
From: Bill Davidsen @ 2006-12-20  4:05 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-raid

Mark Hahn wrote:
>>>> which is right at the edge of what I need. I want to read the doc on
>>>> stripe_cache_size before going huge, if that's K 10MB is a LOT of 
>>>> cache
>>>> when 256 works perfectly in RAID-0.
>
> but they are basically unrelated.  in r5/6, the stripe cache is 
> absolutely
> critical in caching parity chunks.  in r0, never functions this way, 
> though
> it may help some workloads a bit (IOs which aren't naturally aligned 
> to the underlying disk layout.)
>
>>>> Any additional input appreciated, I would expect the speed to be 
>>>> (Ndisk
>>>> - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't
>
> as others have reported, you can actually approach that with "naturally"
> aligned and sized writes.
I don't know what would be natural, I have three drives, 256 chunk size 
and was originally testing with 1MB writes. I have a hard time seeing a 
case where there would be a need to read-alter-rewrite, each chunk 
should be writable as data1, data2, and parity, without readback. I was 
writing directly to the array, so the data should start on a chunk 
boundary. Until I went very large on stripe-cache-size performance was 
almost exactly 100% the write speed of a single drive. There is no 
obvious way to explain that other than writing one drive at a time. And 
shrinking write size by factors of two resulted in decreasing 
performance down to about 13% of the speed of a single drive. Such 
performance just isn't useful, and going to RAID-10 eliminated the 
problem, indicating that the RAID-5 implementation is the cause.
>
>> I'm doing the tests writing 2GB of data to the raw array, in 1MB 
>> writes. The array is RAID-5 with 256 chunk size. I wouldn't really 
>> expect any reads,
>
> but how many disks?  if your 1M writes are to 4 data disks, you stand 
> a chance of streaming (assuming your writes are naturally aligned, or 
> else you'll be somewhat dependent on the stripe cache.)
> in other words, your whole-stripe size is ndisks*chunksize, and for 
> 256K chunks and, say, 14 disks, that's pretty monstrous...
Three drives, so they could be totally isolated from other i/o.
>
> I think that's a factor often overlooked - large chunk sizes, especially
> with r5/6 AND lots of disks, mean you probably won't ever do "blind" 
> updates, and thus need the r/m/w cycle.  in that case, if the stripe 
> cache
> is not big/smart enough, you'll be limited by reads.
I didn't have lots of disks, and when the data and parity are all being 
updated in full chunk increments, there's no reason for a read, since 
the data won't be needed. I agree that it's probably being read, but 
needlessly.
>
> I'd like to experiment with this, to see how much benefit you really 
> get from using larger chunk sizes.  I'm guessing that past 32K
> or so, normal *ata systems don't speedup much.  fabrics with higher 
> latency or command/arbitration overhead would want larger chunks.
>
>> tried was 2K blocks, so I can try other sizes. I have a hard time 
>> picturing why smaller sizes would be better, but that's what testing 
>> is for.
>
> larger writes (from user-space) generally help, probably up to MB's.
> smaller chunks help by making it more likley to do blind parity updates;
> a larger stripe cache can help that too.
I tried 256B to 1MB sizes, 1MB was best, or more correctly least 
unacceptable.
>
> I think I recall an earlier thread regarding how the stripe cache is used
> somewhat naively - that all IO goes through it.  the most important 
> blocks would be parity and "ends" of a write that partially update an 
> underlying chunk.  (conversely, don't bother caching anything which 
> can be blindly written to disk.) 
I fear that last parenthetical isn't being observed.

If it weren't for RAID-1 and RAID-10 being fast I wouldn't complain 
about RAID-5.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2006-12-20  4:05 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-30 14:13 Odd (slow) RAID performance Bill Davidsen
2006-11-30 14:31 ` Roger Lucas
2006-11-30 15:30   ` Bill Davidsen
2006-11-30 15:32     ` Roger Lucas
2006-11-30 21:09       ` Bill Davidsen
2006-12-01  9:24         ` Roger Lucas
2006-12-02  5:27           ` Bill Davidsen
2006-12-05  1:33             ` Dan Williams
2006-12-07 15:51               ` Bill Davidsen
2006-12-08  1:15                 ` Corey Hickey
2006-12-08  8:21                 ` Gabor Gombas
2006-12-08  6:01               ` Neil Brown
2006-12-08  7:28                 ` Neil Brown
2006-12-09 20:20                   ` Bill Davidsen
2006-12-12 17:44                   ` Bill Davidsen
2006-12-12 18:48                     ` Raz Ben-Jehuda(caro)
2006-12-12 21:51                       ` Bill Davidsen
2006-12-13 17:44                         ` Mark Hahn
2006-12-20  4:05                           ` Bill Davidsen
2006-12-09 20:16                 ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).