* Odd (slow) RAID performance @ 2006-11-30 14:13 Bill Davidsen 2006-11-30 14:31 ` Roger Lucas 0 siblings, 1 reply; 20+ messages in thread From: Bill Davidsen @ 2006-11-30 14:13 UTC (permalink / raw) To: linux-raid Pardon if you see this twice, I sent it last night and it never showed up... I was seeing some bad disk performance on a new install of Fedora Core 6, so I did some measurements of write speed, and it would appear that write performance is so slow it can't write my data as fast as it is generated :-( The method: I wrote 2GB of data to various configurations with sync; time bash -c "dd if=/dev/zero bs=1024k count=2048 of=XXXXX; sync" where XXXXX was a raw partition, raw RAID device, or ext2 filesystem over a RAID device. I recorded the time reported by dd, which doesn't include a final sync, and total time from start of write to end of sync, which I believe represents the true effective performance. All tests were run on a dedicated system, with the RAID devices or filesystem freshly created. For a baseline, I wrote to a single drive, single raw partition, which gave about 50MB/s transfer. Then I created a RAID-0 device, striped over three test drives. As expected this gave a speed of about 147 MB/s. Then I created an ext2 filesystem over that device, and the test showed 139 MB/s speed. This was as expected. Then I stopped and deleted the RAID-0 and built a RAID-5 on the same partitions. A write to this raw RAID device showed only 37.5 MB/s!! Putting an ext2 f/s over that device dropped the speed to 35 MB/s. Since I am trying to write bursts at 60MB/s, this is a serious problem for me. Then I recreated a new RAID-10 array on the same partitions. This showed a write speed of 75.8 MB/s, double the speed even though I was (presumably) writing twice the data. And and ext2 f/s on that array showed 74 MB/s write speed. I didn't use /proc/diskstats to gather actual counts, nor do I know if they show actual transfer data below all the levels of o/s magic, but that sounds as if RAID-5 is not working right. I don't have enough space to use RAID-10 for incoming data, so that's not an option. Then I thought that perhaps my chunk size, defaulted to 64k, was too small. So I created and array with 256k chunk size. That showed about 36 MB/s to the raw array, and 32.4 MB/s to an ext2 f/s using the array. Finally I decided to create a new f/s using the "stride=" option, and see if that would work better. I had 256k chunks, two data and a parity per stripe, so I used the data size, 512k, for calculation. The man page says to use the f/s block size, 4k in this case, for calculation, so 512/4 was 128 stride size, and I used that. The increase was below the noise, about 50KB/s faster. Any thoughts on this gratefully accepted, I may try the motherboard RAID if I can't make this work, and it certainly explains why my swapping is so slow. That I can switch to RAID-1, it's used mainly for test, big data sets and suspend. If I can't make this fast I'd like to understand why it's slow. I did make the raw results <http://www.tmr.com/%7Edavidsen/RAID_speed.html> available if people want to see more info. http://www.tmr.com/~davidsen/RAID_speed.html -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Odd (slow) RAID performance 2006-11-30 14:13 Odd (slow) RAID performance Bill Davidsen @ 2006-11-30 14:31 ` Roger Lucas 2006-11-30 15:30 ` Bill Davidsen 0 siblings, 1 reply; 20+ messages in thread From: Roger Lucas @ 2006-11-30 14:31 UTC (permalink / raw) To: 'Bill Davidsen', linux-raid > -----Original Message----- > From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- > owner@vger.kernel.org] On Behalf Of Bill Davidsen > Sent: 30 November 2006 14:13 > To: linux-raid@vger.kernel.org > Subject: Odd (slow) RAID performance > > Pardon if you see this twice, I sent it last night and it never showed > up... > > I was seeing some bad disk performance on a new install of Fedora Core > 6, so I did some measurements of write speed, and it would appear that > write performance is so slow it can't write my data as fast as it is > generated :-( What drive configuration are you using (SCSI / ATA / SATA), what chipset is providing the disk interface and what cpu are you running with? Thanks, RL ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-11-30 14:31 ` Roger Lucas @ 2006-11-30 15:30 ` Bill Davidsen 2006-11-30 15:32 ` Roger Lucas 0 siblings, 1 reply; 20+ messages in thread From: Bill Davidsen @ 2006-11-30 15:30 UTC (permalink / raw) To: Roger Lucas; +Cc: linux-raid Roger Lucas wrote: >> -----Original Message----- >> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- >> owner@vger.kernel.org] On Behalf Of Bill Davidsen >> Sent: 30 November 2006 14:13 >> To: linux-raid@vger.kernel.org >> Subject: Odd (slow) RAID performance >> >> Pardon if you see this twice, I sent it last night and it never showed >> up... >> >> I was seeing some bad disk performance on a new install of Fedora Core >> 6, so I did some measurements of write speed, and it would appear that >> write performance is so slow it can't write my data as fast as it is >> generated :-( >> > > What drive configuration are you using (SCSI / ATA / SATA), what chipset is > providing the disk interface and what cpu are you running with? 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the ata-piix driver, with drive cache set to write-back. It's not obvious to me why that matters, but if it helps you see the problem I''m glad to provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on plain stripes, so I'm assuming that either the RAID-5 code is not working well or I haven't set it up optimally. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Odd (slow) RAID performance 2006-11-30 15:30 ` Bill Davidsen @ 2006-11-30 15:32 ` Roger Lucas 2006-11-30 21:09 ` Bill Davidsen 0 siblings, 1 reply; 20+ messages in thread From: Roger Lucas @ 2006-11-30 15:32 UTC (permalink / raw) To: 'Bill Davidsen'; +Cc: linux-raid > > What drive configuration are you using (SCSI / ATA / SATA), what chipset > is > > providing the disk interface and what cpu are you running with? > 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the > ata-piix driver, with drive cache set to write-back. It's not obvious to > me why that matters, but if it helps you see the problem I''m glad to > provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on > plain stripes, so I'm assuming that either the RAID-5 code is not > working well or I haven't set it up optimally. If it had been ATA, and you had two drives as master+slave on the same cable, then they would be fast individually but slow as a pair. RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow then you would see some degradation from that too. We have similar hardware here so I'll run some tests here and see what I get... ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-11-30 15:32 ` Roger Lucas @ 2006-11-30 21:09 ` Bill Davidsen 2006-12-01 9:24 ` Roger Lucas 0 siblings, 1 reply; 20+ messages in thread From: Bill Davidsen @ 2006-11-30 21:09 UTC (permalink / raw) To: Roger Lucas; +Cc: linux-raid Roger Lucas wrote: >>> What drive configuration are you using (SCSI / ATA / SATA), what chipset >>> >> is >> >>> providing the disk interface and what cpu are you running with? >>> >> 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the >> ata-piix driver, with drive cache set to write-back. It's not obvious to >> me why that matters, but if it helps you see the problem I''m glad to >> provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on >> plain stripes, so I'm assuming that either the RAID-5 code is not >> working well or I haven't set it up optimally. >> > > If it had been ATA, and you had two drives as master+slave on the same > cable, then they would be fast individually but slow as a pair. > > RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow then > you would see some degradation from that too. > > We have similar hardware here so I'll run some tests here and see what I > get... Much appreciated. Since my last note I tried adding --bitmap=internal to the array. Bot is that a write performance killer. I will have the chart updated in a minute, but write dropped to ~15MB/s with bitmap. Since Fedora can't seem to shut the last array down cleanly, I get a rebuild on every boot :-( So the array for the LVM has bitmap on, as I hate to rebuild 1.5TB regularly. Have to do some compromises on that! Thanks for looking! -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Odd (slow) RAID performance 2006-11-30 21:09 ` Bill Davidsen @ 2006-12-01 9:24 ` Roger Lucas 2006-12-02 5:27 ` Bill Davidsen 0 siblings, 1 reply; 20+ messages in thread From: Roger Lucas @ 2006-12-01 9:24 UTC (permalink / raw) To: 'Bill Davidsen'; +Cc: linux-raid > Roger Lucas wrote: > >>> What drive configuration are you using (SCSI / ATA / SATA), what > chipset > >>> > >> is > >> > >>> providing the disk interface and what cpu are you running with? > >>> > >> 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the > >> ata-piix driver, with drive cache set to write-back. It's not obvious > to > >> me why that matters, but if it helps you see the problem I''m glad to > >> provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on > >> plain stripes, so I'm assuming that either the RAID-5 code is not > >> working well or I haven't set it up optimally. > >> > > > > If it had been ATA, and you had two drives as master+slave on the same > > cable, then they would be fast individually but slow as a pair. > > > > RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow > then > > you would see some degradation from that too. > > > > We have similar hardware here so I'll run some tests here and see what I > > get... > > Much appreciated. Since my last note I tried adding --bitmap=internal to > the array. Bot is that a write performance killer. I will have the chart > updated in a minute, but write dropped to ~15MB/s with bitmap. Since > Fedora can't seem to shut the last array down cleanly, I get a rebuild > on every boot :-( So the array for the LVM has bitmap on, as I hate to > rebuild 1.5TB regularly. Have to do some compromises on that! > Hi Bill, Here are the results of my tests here: CPU: Intel Celetron 2.7GHz socket 775 MB: Abit LG-81 (Lakeport ICH7 chipset) HDD: 4 x Seagate SATA ST3160812AS (directly connected to ICH7) OS: Linux 2.6.16-xen root@hydra:~# uname -a Linux hydra 2.6.16-xen #1 SMP Thu Apr 13 18:46:07 BST 2006 i686 GNU/Linux root@hydra:~# All four disks are built into a RAID-5 array to provide ~420GB real storage. Most of this is then used by the other Xen virtual machines but there is a bit of space left on this server to play with in the Dom-0. I wasn't able to run I/O tests with "dd" on the disks themselves as I don't have a spare partition to corrupt, but hdparm gives: root@hydra:~# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 3296 MB in 2.00 seconds = 1648.48 MB/sec Timing buffered disk reads: 180 MB in 3.01 seconds = 59.78 MB/sec root@hydra:~# Which is exactly what I would expect as this is the performance limit of the disk. We have a lot of ICH7/ICH7R-based servers here and all can run the disk at their maximum physical speed without problems. root@hydra:~# cat /proc/mdstat Personalities : [raid5] [raid4] md0 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1] 468647808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none> root@hydra:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/bigraid-root 10G 1.3G 8.8G 13% / <snip> root@hydra:~# vgs VG #PV #LV #SN Attr VSize VFree bigraid 1 13 0 wz--n- 446.93G 11.31G root@hydra:~# lvcreate --name testspeed --size 2G bigraid Logical volume "testspeed" created root@hydra:~# *** Now for the LVM over RAID-5 read/write tests *** root@hydra:~# sync; time bash -c "dd if=/dev/zero bs=1024k count=2048 of=/dev/bigraid/testspeed; sync" 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 33.7345 seconds, 63.7 MB/s real 0m34.211s user 0m0.020s sys 0m2.970s root@hydra:~# sync; time bash -c "dd of=/dev/zero bs=1024k count=2048 if=/dev/bigraid/testspeed; sync" 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 38.1175 seconds, 56.3 MB/s real 0m38.637s user 0m0.010s sys 0m3.260s root@hydra:~# During the above two tests, the CPU showed about 35% idle using "top". *** Now for the file system read/write tests *** (Reiser over LVM over RAID-5) root@hydra:~# mount /dev/mapper/bigraid-root on / type reiserfs (rw) <snip> root@hydra:~# root@hydra:~# sync; time bash -c "dd if=/dev/zero bs=1024k count=2048 of=~/testspeed; sync" 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 29.8863 seconds, 71.9 MB/s real 0m32.289s user 0m0.000s sys 0m4.440s root@hydra:~# sync; time bash -c "dd of=/dev/null bs=1024k count=2048 if=~/testspeed; sync" 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 40.332 seconds, 53.2 MB/s real 0m40.973s user 0m0.010s sys 0m2.640s root@hydra:~# During the above two tests, the CPU showed between 0% and 30% idle using "top". Just for curiousity, I started the RAID-5 check process to see what load it generated... root@hydra:~# cat /sys/block/md0/md/mismatch_cnt 0 root@hydra:~# echo check > /sys/block/md0/md/sync_action root@hydra:~# cat /sys/block/md0/md/sync_action check root@hydra:~# cat /proc/mdstat Personalities : [raid5] [raid4] md0 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1] 468647808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] [>....................] resync = 1.0% (1671552/156215936) finish=101.8min speed=25292K/sec unused devices: <none> root@hydra:~# Whilst the above test was running, the CPU load was between 3% and 7%, so running the RAID array isn't that hard for it... ------------------------- So, using a 4-disk RAID-5 array with an ICH7, I get about 64M write and 54MB read prformance. The processor is about 35% idle whilst the test is running - I'm not sure why this is, I would have expected the processor load to be 0% idle as it should be hitting the hard disk as fast as possible and waiting for it otherwise.... If I run over Reiser, the processor load changes a lot more, varying between 0% and 35% idle. It also takes a couple of seconds after the test has finished before the load drops down to zero on the write test, so I suspect these results are basically the same as the raw LVM-over-RAID5 performance. Summary - it is a little faster with 4 disks rather than the 37.5 MB/s that you have with just the three, but it is WAY off the theoretical target of 3x60MB = 180MB that could be expected given that you are running a 4-disk RAID-5 array. On the flip side, the performance is good enough for me, so it is not causing me a problem, but it seems that there should be a performance boost available somewhere! Best regards, Roger ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-01 9:24 ` Roger Lucas @ 2006-12-02 5:27 ` Bill Davidsen 2006-12-05 1:33 ` Dan Williams 0 siblings, 1 reply; 20+ messages in thread From: Bill Davidsen @ 2006-12-02 5:27 UTC (permalink / raw) To: Roger Lucas; +Cc: linux-raid, neilb Roger Lucas wrote: >> Roger Lucas wrote: >>>>> What drive configuration are you using (SCSI / ATA / SATA), what >> chipset >>>> is >>>> >>>>> providing the disk interface and what cpu are you running with? >>>>> >>>> 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the >>>> ata-piix driver, with drive cache set to write-back. It's not obvious >> to >>>> me why that matters, but if it helps you see the problem I''m glad to >>>> provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on >>>> plain stripes, so I'm assuming that either the RAID-5 code is not >>>> working well or I haven't set it up optimally. >>>> >>> If it had been ATA, and you had two drives as master+slave on the same >>> cable, then they would be fast individually but slow as a pair. >>> >>> RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow >> then >>> you would see some degradation from that too. >>> >>> We have similar hardware here so I'll run some tests here and see what I >>> get... >> Much appreciated. Since my last note I tried adding --bitmap=internal to >> the array. Bot is that a write performance killer. I will have the chart >> updated in a minute, but write dropped to ~15MB/s with bitmap. Since >> Fedora can't seem to shut the last array down cleanly, I get a rebuild >> on every boot :-( So the array for the LVM has bitmap on, as I hate to >> rebuild 1.5TB regularly. Have to do some compromises on that! >> > > Hi Bill, > > Here are the results of my tests here: > > CPU: Intel Celetron 2.7GHz socket 775 > MB: Abit LG-81 (Lakeport ICH7 chipset) > HDD: 4 x Seagate SATA ST3160812AS (directly connected to ICH7) > OS: Linux 2.6.16-xen > > root@hydra:~# uname -a > Linux hydra 2.6.16-xen #1 SMP Thu Apr 13 18:46:07 BST 2006 i686 GNU/Linux > root@hydra:~# > > All four disks are built into a RAID-5 array to provide ~420GB real storage. > Most of this is then used by the other Xen virtual machines but there is a > bit of space left on this server to play with in the Dom-0. > > I wasn't able to run I/O tests with "dd" on the disks themselves as I don't > have a spare partition to corrupt, but hdparm gives: > > root@hydra:~# hdparm -tT /dev/sda > > /dev/sda: > Timing cached reads: 3296 MB in 2.00 seconds = 1648.48 MB/sec > Timing buffered disk reads: 180 MB in 3.01 seconds = 59.78 MB/sec > root@hydra:~# > > Which is exactly what I would expect as this is the performance limit of the > disk. We have a lot of ICH7/ICH7R-based servers here and all can run the > disk at their maximum physical speed without problems. > > root@hydra:~# cat /proc/mdstat > Personalities : [raid5] [raid4] > md0 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1] > 468647808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] > > unused devices: <none> > root@hydra:~# df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/bigraid-root > 10G 1.3G 8.8G 13% / > <snip> > root@hydra:~# vgs > VG #PV #LV #SN Attr VSize VFree > bigraid 1 13 0 wz--n- 446.93G 11.31G > root@hydra:~# lvcreate --name testspeed --size 2G bigraid > Logical volume "testspeed" created > root@hydra:~# > > *** Now for the LVM over RAID-5 read/write tests *** > > root@hydra:~# sync; time bash -c "dd if=/dev/zero bs=1024k count=2048 > of=/dev/bigraid/testspeed; sync" > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 33.7345 seconds, 63.7 MB/s > > real 0m34.211s > user 0m0.020s > sys 0m2.970s > root@hydra:~# sync; time bash -c "dd of=/dev/zero bs=1024k count=2048 > if=/dev/bigraid/testspeed; sync" > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 38.1175 seconds, 56.3 MB/s > > real 0m38.637s > user 0m0.010s > sys 0m3.260s > root@hydra:~# > > During the above two tests, the CPU showed about 35% idle using "top". > > *** Now for the file system read/write tests *** > (Reiser over LVM over RAID-5) > > root@hydra:~# mount > /dev/mapper/bigraid-root on / type reiserfs (rw) > <snip> > root@hydra:~# > > > root@hydra:~# sync; time bash -c "dd if=/dev/zero bs=1024k count=2048 > of=~/testspeed; sync" > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 29.8863 seconds, 71.9 MB/s > > real 0m32.289s > user 0m0.000s > sys 0m4.440s > root@hydra:~# sync; time bash -c "dd of=/dev/null bs=1024k count=2048 > if=~/testspeed; sync" > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 40.332 seconds, 53.2 MB/s > > real 0m40.973s > user 0m0.010s > sys 0m2.640s > root@hydra:~# > > During the above two tests, the CPU showed between 0% and 30% idle using > "top". > > Just for curiousity, I started the RAID-5 check process to see what load it > generated... > > root@hydra:~# cat /sys/block/md0/md/mismatch_cnt > 0 > root@hydra:~# echo check > /sys/block/md0/md/sync_action > root@hydra:~# cat /sys/block/md0/md/sync_action > check > root@hydra:~# cat /proc/mdstat > Personalities : [raid5] [raid4] > md0 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1] > 468647808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] > [>....................] resync = 1.0% (1671552/156215936) > finish=101.8min speed=25292K/sec > > unused devices: <none> > root@hydra:~# > > Whilst the above test was running, the CPU load was between 3% and 7%, so > running the RAID array isn't that hard for it... > > ------------------------- > > So, using a 4-disk RAID-5 array with an ICH7, I get about 64M write and 54MB > read prformance. The processor is about 35% idle whilst the test is running > - I'm not sure why this is, I would have expected the processor load to be > 0% idle as it should be hitting the hard disk as fast as possible and > waiting for it otherwise.... > > If I run over Reiser, the processor load changes a lot more, varying between > 0% and 35% idle. It also takes a couple of seconds after the test has > finished before the load drops down to zero on the write test, so I suspect > these results are basically the same as the raw LVM-over-RAID5 performance. > > Summary - it is a little faster with 4 disks rather than the 37.5 MB/s that > you have with just the three, but it is WAY off the theoretical target of > 3x60MB = 180MB that could be expected given that you are running a 4-disk > RAID-5 array. > > On the flip side, the performance is good enough for me, so it is not > causing me a problem, but it seems that there should be a performance boost > available somewhere! > > Best regards, > > Roger Thank you so much for verifying this. I do keep enough room on my drives to run tests by creating any kind of whatever I need, but the point is clear: with N drives striped the transfer rate is N x base rate of one drive; with RAID-5 it is about the speed of one drive, suggesting that the md code serializes writes. If true, BOO, HISS! Can you explain and educate us, Neal? This look like terrible performance. -- Bill Davidsen He was a full-time professional cat, not some moonlighting ferret or weasel. He knew about these things. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-02 5:27 ` Bill Davidsen @ 2006-12-05 1:33 ` Dan Williams 2006-12-07 15:51 ` Bill Davidsen 2006-12-08 6:01 ` Neil Brown 0 siblings, 2 replies; 20+ messages in thread From: Dan Williams @ 2006-12-05 1:33 UTC (permalink / raw) To: Bill Davidsen; +Cc: Roger Lucas, linux-raid, neilb On 12/1/06, Bill Davidsen <davidsen@tmr.com> wrote: > Thank you so much for verifying this. I do keep enough room on my drives > to run tests by creating any kind of whatever I need, but the point is > clear: with N drives striped the transfer rate is N x base rate of one > drive; with RAID-5 it is about the speed of one drive, suggesting that > the md code serializes writes. > > If true, BOO, HISS! > > Can you explain and educate us, Neal? This look like terrible performance. > Just curious what is your stripe_cache_size setting in sysfs? Neil, please include me in the education if what follows is incorrect: Read performance in kernels up to and including 2.6.19 is hindered by needing to go through the stripe cache. This situation should improve with the stripe-cache-bypass patches currently in -mm. As Raz reported in some cases the performance increase of this approach is 30% which is roughly equivalent to the performance difference I see of a 4-disk raid5 versus a 3-disk raid0. For the write case I can say that MD does not serialize writes. If by serialize you mean that there is 1:1 correlation between writes to the parity disk and writes to a data disk. To illustrate I instrumented MD to count how many times it issued a write to the parity disk and compared that to how many writes it performed to the member disks for the workload "dd if=/dev/zero of=/dev/md0 bs=1024k count=100". I recorded 8544 parity writes and 25600 member disk writes which is about 3 member disk writes per parity write, or pretty close to optimal for a 4-disk array. So, serialization is not the cause, performing sub-stripe width writes is not the cause as >98% of the writes happened without needing to read old data from the disks. However, I see the same performance on my system, about equal to a single disk. Here is where I step into supposition territory. Perhaps the discrepancy is related to the size of the requests going to the block layer. raid5 always makes page sized requests with the expectation that they will coalesce into larger requests in the block layer. Maybe we are missing coalescing opportunities in raid5 compared to what happens in the raid0 case? Are there any io scheduler knobs to turn along these lines? Dan ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-05 1:33 ` Dan Williams @ 2006-12-07 15:51 ` Bill Davidsen 2006-12-08 1:15 ` Corey Hickey 2006-12-08 8:21 ` Gabor Gombas 2006-12-08 6:01 ` Neil Brown 1 sibling, 2 replies; 20+ messages in thread From: Bill Davidsen @ 2006-12-07 15:51 UTC (permalink / raw) To: Dan Williams; +Cc: Roger Lucas, linux-raid, neilb Dan Williams wrote: > On 12/1/06, Bill Davidsen <davidsen@tmr.com> wrote: >> Thank you so much for verifying this. I do keep enough room on my drives >> to run tests by creating any kind of whatever I need, but the point is >> clear: with N drives striped the transfer rate is N x base rate of one >> drive; with RAID-5 it is about the speed of one drive, suggesting that >> the md code serializes writes. >> >> If true, BOO, HISS! >> >> Can you explain and educate us, Neal? This look like terrible >> performance. >> > Just curious what is your stripe_cache_size setting in sysfs? > > Neil, please include me in the education if what follows is incorrect: > > Read performance in kernels up to and including 2.6.19 is hindered by > needing to go through the stripe cache. This situation should improve > with the stripe-cache-bypass patches currently in -mm. As Raz > reported in some cases the performance increase of this approach is > 30% which is roughly equivalent to the performance difference I see of > a 4-disk raid5 versus a 3-disk raid0. > > For the write case I can say that MD does not serialize writes. If by > serialize you mean that there is 1:1 correlation between writes to the > parity disk and writes to a data disk. To illustrate I instrumented > MD to count how many times it issued a write to the parity disk and > compared that to how many writes it performed to the member disks for > the workload "dd if=/dev/zero of=/dev/md0 bs=1024k count=100". I > recorded 8544 parity writes and 25600 member disk writes which is > about 3 member disk writes per parity write, or pretty close to > optimal for a 4-disk array. So, serialization is not the cause, > performing sub-stripe width writes is not the cause as >98% of the > writes happened without needing to read old data from the disks. > However, I see the same performance on my system, about equal to a > single disk. But the number of writes isn't an indication of serialization. If I write disk A, then B, then C, then D, you can't tell if I waited for each write to finish before starting the next, or did them in parallel. And since the write speed is equal to the speed of a single drive, effectively that's what happens, even though I can't see it in the code. I also suspect that write are not being combined, since writing the 2GB test runs at one-drive speed writing 1MB blocks, but floppy speed writing 2k blocks. And no, I'm not running out of CPU to do the overhead, it jumps from 2-4% to 30% of one CPU, but on an unloaded SMP system it's not CPU bound. > > Here is where I step into supposition territory. Perhaps the > discrepancy is related to the size of the requests going to the block > layer. raid5 always makes page sized requests with the expectation > that they will coalesce into larger requests in the block layer. > Maybe we are missing coalescing opportunities in raid5 compared to > what happens in the raid0 case? Are there any io scheduler knobs to > turn along these lines? Good thought, I had already tried that but not reported it, changing schedulers make no significant difference. In the range of 2-3%, which is close to the measurement jitter due to head position or whatever. I changed my swap to RAID-10, but RAID-5 just can't keep up with 70-100MB/s data bursts which I need. I'm probably going to scrap software RAID and go back to a controller, the write speeds are simply not even close to what they should be. I have one more thing to try, a tool I wrote to chase another problem a few years ago. I'll report if I find something. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-07 15:51 ` Bill Davidsen @ 2006-12-08 1:15 ` Corey Hickey 2006-12-08 8:21 ` Gabor Gombas 1 sibling, 0 replies; 20+ messages in thread From: Corey Hickey @ 2006-12-08 1:15 UTC (permalink / raw) To: linux-raid Bill Davidsen wrote: > Dan Williams wrote: >> On 12/1/06, Bill Davidsen <davidsen@tmr.com> wrote: >>> Thank you so much for verifying this. I do keep enough room on my drives >>> to run tests by creating any kind of whatever I need, but the point is >>> clear: with N drives striped the transfer rate is N x base rate of one >>> drive; with RAID-5 it is about the speed of one drive, suggesting that >>> the md code serializes writes. >>> >>> If true, BOO, HISS! >>> >>> Can you explain and educate us, Neal? This look like terrible >>> performance. >>> >> Just curious what is your stripe_cache_size setting in sysfs? >> >> Neil, please include me in the education if what follows is incorrect: >> >> Read performance in kernels up to and including 2.6.19 is hindered by >> needing to go through the stripe cache. This situation should improve >> with the stripe-cache-bypass patches currently in -mm. As Raz >> reported in some cases the performance increase of this approach is >> 30% which is roughly equivalent to the performance difference I see of >> a 4-disk raid5 versus a 3-disk raid0. >> >> For the write case I can say that MD does not serialize writes. If by >> serialize you mean that there is 1:1 correlation between writes to the >> parity disk and writes to a data disk. To illustrate I instrumented >> MD to count how many times it issued a write to the parity disk and >> compared that to how many writes it performed to the member disks for >> the workload "dd if=/dev/zero of=/dev/md0 bs=1024k count=100". I >> recorded 8544 parity writes and 25600 member disk writes which is >> about 3 member disk writes per parity write, or pretty close to >> optimal for a 4-disk array. So, serialization is not the cause, >> performing sub-stripe width writes is not the cause as >98% of the >> writes happened without needing to read old data from the disks. >> However, I see the same performance on my system, about equal to a >> single disk. > > But the number of writes isn't an indication of serialization. If I > write disk A, then B, then C, then D, you can't tell if I waited for > each write to finish before starting the next, or did them in parallel. > And since the write speed is equal to the speed of a single drive, > effectively that's what happens, even though I can't see it in the code. For what it's worth, my read and write speeds on a 5-disk RAID-5 are somewhat faster than the speed of any single drive. The array is a mixture of two different SATA drives and one IDE drive. Sustained individual read performances range from 56 MB/sec for the IDE drive to 68 MB/sec for the faster SATA drives. I can read from the RAID-5 at about 100MB/sec. I can't give precise numbers for write speeds, except to say that I can write to a file on the filesystem (which is mostly full and probably somewhat fragmented) at about 83 MB/sec. None of those numbers are equal to the theoretical maximum performance, so I see your point, but they're still faster than one individual disk. > I also suspect that write are not being combined, since writing the 2GB > test runs at one-drive speed writing 1MB blocks, but floppy speed > writing 2k blocks. And no, I'm not running out of CPU to do the > overhead, it jumps from 2-4% to 30% of one CPU, but on an unloaded SMP > system it's not CPU bound. >> >> Here is where I step into supposition territory. Perhaps the >> discrepancy is related to the size of the requests going to the block >> layer. raid5 always makes page sized requests with the expectation >> that they will coalesce into larger requests in the block layer. >> Maybe we are missing coalescing opportunities in raid5 compared to >> what happens in the raid0 case? Are there any io scheduler knobs to >> turn along these lines? > > Good thought, I had already tried that but not reported it, changing > schedulers make no significant difference. In the range of 2-3%, which > is close to the measurement jitter due to head position or whatever. > > I changed my swap to RAID-10, but RAID-5 just can't keep up with > 70-100MB/s data bursts which I need. I'm probably going to scrap > software RAID and go back to a controller, the write speeds are simply > not even close to what they should be. I have one more thing to try, a > tool I wrote to chase another problem a few years ago. I'll report if I > find something. I have read that using RAID to stripe swap space is ill-advised, or at least unnecessary. The kernel will stripe multiple swap devices if you assign them the same priority. http://tldp.org/HOWTO/Software-RAID-HOWTO-2.html If you've been using RAID-10 for swap, then I think you could just assign multiple RAID-1 devices the same swap priority for the same effect with (perhaps) less overhead. -Corey ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-07 15:51 ` Bill Davidsen 2006-12-08 1:15 ` Corey Hickey @ 2006-12-08 8:21 ` Gabor Gombas 1 sibling, 0 replies; 20+ messages in thread From: Gabor Gombas @ 2006-12-08 8:21 UTC (permalink / raw) To: Bill Davidsen; +Cc: Dan Williams, Roger Lucas, linux-raid, neilb On Thu, Dec 07, 2006 at 10:51:25AM -0500, Bill Davidsen wrote: > I also suspect that write are not being combined, since writing the 2GB > test runs at one-drive speed writing 1MB blocks, but floppy speed > writing 2k blocks. And no, I'm not running out of CPU to do the > overhead, it jumps from 2-4% to 30% of one CPU, but on an unloaded SMP > system it's not CPU bound. You could use blktrace to see the actual requests that the md code sends down to the device, including request merging actions. That may provide more insight into what really happens. Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-05 1:33 ` Dan Williams 2006-12-07 15:51 ` Bill Davidsen @ 2006-12-08 6:01 ` Neil Brown 2006-12-08 7:28 ` Neil Brown 2006-12-09 20:16 ` Bill Davidsen 1 sibling, 2 replies; 20+ messages in thread From: Neil Brown @ 2006-12-08 6:01 UTC (permalink / raw) To: Dan Williams; +Cc: Bill Davidsen, Roger Lucas, linux-raid On Monday December 4, dan.j.williams@gmail.com wrote: > > Here is where I step into supposition territory. Perhaps the > discrepancy is related to the size of the requests going to the block > layer. raid5 always makes page sized requests with the expectation > that they will coalesce into larger requests in the block layer. > Maybe we are missing coalescing opportunities in raid5 compared to > what happens in the raid0 case? Are there any io scheduler knobs to > turn along these lines? This can be measured. /proc/diskstats reports the number of requests as well as the number of sectors. The number of write requests is column 8. The number of write sectors is column 10. Comparing these you can get an average request size. I have found that the average request size is proportional to the size of the stripe cache (roughly, with limits) but increasing it doesn't increase through put. I have measured very slow write throughput for raid5 as well, though 2.6.18 does seem to have the same problem. I'll double check and do a git bisect and see what I can come up with. NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-08 6:01 ` Neil Brown @ 2006-12-08 7:28 ` Neil Brown 2006-12-09 20:20 ` Bill Davidsen 2006-12-12 17:44 ` Bill Davidsen 2006-12-09 20:16 ` Bill Davidsen 1 sibling, 2 replies; 20+ messages in thread From: Neil Brown @ 2006-12-08 7:28 UTC (permalink / raw) To: Dan Williams, Bill Davidsen, Roger Lucas, linux-raid On Friday December 8, neilb@suse.de wrote: > I have measured very slow write throughput for raid5 as well, though > 2.6.18 does seem to have the same problem. I'll double check and do a > git bisect and see what I can come up with. Correction... it isn't 2.6.18 that fixes the problem. It is compiling without LOCKDEP or PROVE_LOCKING. I remove those and suddenly a 3 drive raid5 is faster than a single drive rather than much slower. Bill: Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ?? NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-08 7:28 ` Neil Brown @ 2006-12-09 20:20 ` Bill Davidsen 2006-12-12 17:44 ` Bill Davidsen 1 sibling, 0 replies; 20+ messages in thread From: Bill Davidsen @ 2006-12-09 20:20 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Williams, Roger Lucas, linux-raid Neil Brown wrote: > On Friday December 8, neilb@suse.de wrote: > >> I have measured very slow write throughput for raid5 as well, though >> 2.6.18 does seem to have the same problem. I'll double check and do a >> git bisect and see what I can come up with. >> > > Correction... it isn't 2.6.18 that fixes the problem. It is compiling > without LOCKDEP or PROVE_LOCKING. I remove those and suddenly a > 3 drive raid5 is faster than a single drive rather than much slower. > > Bill: Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ?? > I have to check tomorrow, I'm using the Fedora kernel (as noted in the first post on this) rather than one I built, just so others could verify my results as several have been kind enough to do. Have to run, but I will check tomorrow or Monday morning early at the latest. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-08 7:28 ` Neil Brown 2006-12-09 20:20 ` Bill Davidsen @ 2006-12-12 17:44 ` Bill Davidsen 2006-12-12 18:48 ` Raz Ben-Jehuda(caro) 1 sibling, 1 reply; 20+ messages in thread From: Bill Davidsen @ 2006-12-12 17:44 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Williams, Roger Lucas, linux-raid Neil Brown wrote: > On Friday December 8, neilb@suse.de wrote: > >> I have measured very slow write throughput for raid5 as well, though >> 2.6.18 does seem to have the same problem. I'll double check and do a >> git bisect and see what I can come up with. >> > > Correction... it isn't 2.6.18 that fixes the problem. It is compiling > without LOCKDEP or PROVE_LOCKING. I remove those and suddenly a > 3 drive raid5 is faster than a single drive rather than much slower. > > Bill: Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ?? YES and NO respectively. I did try increasing the stripe_cache_size and got better but not anywhere near max performance, perhaps the PROVE_LOCKING is still at fault, although performance of RAID-0 is as expected, so I'm dubious. In any case, by pushing the size from 256 to 1024, 4096, and finally 10240 I was able to raise the speed to 82MB/s, which is right at the edge of what I need. I want to read the doc on stripe_cache_size before going huge, if that's K 10MB is a LOT of cache when 256 works perfectly in RAID-0. I noted that the performance really was bad using 2k write, before increasing the stripe_cache, I will repeat that after doing some other "real work" things. Any additional input appreciated, I would expect the speed to be (Ndisk - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't makes me suspect there's unintended serialization or buffering, even when not need (and NOT wanted). Thanks for the feedback, I'm updating the files as I type. http://www.tmr.com/~davidsen/RAID_speed http://www.tmr.com/~davidsen/FC6-config -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-12 17:44 ` Bill Davidsen @ 2006-12-12 18:48 ` Raz Ben-Jehuda(caro) 2006-12-12 21:51 ` Bill Davidsen 0 siblings, 1 reply; 20+ messages in thread From: Raz Ben-Jehuda(caro) @ 2006-12-12 18:48 UTC (permalink / raw) To: Bill Davidsen; +Cc: Roger Lucas, linux-raid On 12/12/06, Bill Davidsen <davidsen@tmr.com> wrote: > Neil Brown wrote: > > On Friday December 8, neilb@suse.de wrote: > > > >> I have measured very slow write throughput for raid5 as well, though > >> 2.6.18 does seem to have the same problem. I'll double check and do a > >> git bisect and see what I can come up with. > >> > > > > Correction... it isn't 2.6.18 that fixes the problem. It is compiling > > without LOCKDEP or PROVE_LOCKING. I remove those and suddenly a > > 3 drive raid5 is faster than a single drive rather than much slower. > > > > Bill: Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ?? > > YES and NO respectively. I did try increasing the stripe_cache_size and > got better but not anywhere near max performance, perhaps the > PROVE_LOCKING is still at fault, although performance of RAID-0 is as > expected, so I'm dubious. In any case, by pushing the size from 256 to > 1024, 4096, and finally 10240 I was able to raise the speed to 82MB/s, > which is right at the edge of what I need. I want to read the doc on > stripe_cache_size before going huge, if that's K 10MB is a LOT of cache > when 256 works perfectly in RAID-0. > > I noted that the performance really was bad using 2k write, before > increasing the stripe_cache, I will repeat that after doing some other > "real work" things. > > Any additional input appreciated, I would expect the speed to be (Ndisk > - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't > makes me suspect there's unintended serialization or buffering, even > when not need (and NOT wanted). > > Thanks for the feedback, I'm updating the files as I type. > http://www.tmr.com/~davidsen/RAID_speed > http://www.tmr.com/~davidsen/FC6-config > > -- > bill davidsen <davidsen@tmr.com> > CTO TMR Associates, Inc > Doing interesting things with small computers since 1979 > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Bill helllo I have been working on raid5 performance write throughout. The whole idea is the access pattern. One should buffers with respect to the size of stripe. this way you will be able to eiliminate the undesired reads. By accessing it correctly I have managed reach a write throughout with respect to the number of disks in the raid. -- Raz ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-12 18:48 ` Raz Ben-Jehuda(caro) @ 2006-12-12 21:51 ` Bill Davidsen 2006-12-13 17:44 ` Mark Hahn 0 siblings, 1 reply; 20+ messages in thread From: Bill Davidsen @ 2006-12-12 21:51 UTC (permalink / raw) To: Raz Ben-Jehuda(caro); +Cc: Roger Lucas, linux-raid Raz Ben-Jehuda(caro) wrote: > On 12/12/06, Bill Davidsen <davidsen@tmr.com> wrote: >> Neil Brown wrote: >> > On Friday December 8, neilb@suse.de wrote: >> > >> >> I have measured very slow write throughput for raid5 as well, though >> >> 2.6.18 does seem to have the same problem. I'll double check and >> do a >> >> git bisect and see what I can come up with. >> >> >> > >> > Correction... it isn't 2.6.18 that fixes the problem. It is compiling >> > without LOCKDEP or PROVE_LOCKING. I remove those and suddenly a >> > 3 drive raid5 is faster than a single drive rather than much slower. >> > >> > Bill: Do you have LOCKDEP or PROVE_LOCKING enabled in your .config ?? >> >> YES and NO respectively. I did try increasing the stripe_cache_size and >> got better but not anywhere near max performance, perhaps the >> PROVE_LOCKING is still at fault, although performance of RAID-0 is as >> expected, so I'm dubious. In any case, by pushing the size from 256 to >> 1024, 4096, and finally 10240 I was able to raise the speed to 82MB/s, >> which is right at the edge of what I need. I want to read the doc on >> stripe_cache_size before going huge, if that's K 10MB is a LOT of cache >> when 256 works perfectly in RAID-0. >> >> I noted that the performance really was bad using 2k write, before >> increasing the stripe_cache, I will repeat that after doing some other >> "real work" things. >> >> Any additional input appreciated, I would expect the speed to be (Ndisk >> - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't >> makes me suspect there's unintended serialization or buffering, even >> when not need (and NOT wanted). >> >> Thanks for the feedback, I'm updating the files as I type. >> http://www.tmr.com/~davidsen/RAID_speed >> http://www.tmr.com/~davidsen/FC6-config >> >> -- >> bill davidsen <davidsen@tmr.com> >> CTO TMR Associates, Inc >> Doing interesting things with small computers since 1979 > > Bill helllo > I have been working on raid5 performance write throughout. > The whole idea is the access pattern. > One should buffers with respect to the size of stripe. > this way you will be able to eiliminate the undesired reads. > By accessing it correctly I have managed reach a write > throughout with respect to the number of disks in the raid. > > I'm doing the tests writing 2GB of data to the raw array, in 1MB writes. The array is RAID-5 with 256 chunk size. I wouldn't really expect any reads, unless I totally misunderstand how all those numbers work together. I was really trying to avoid any issues there.However, the only other size I have tried was 2K blocks, so I can try other sizes. I have a hard time picturing why smaller sizes would be better, but that's what testing is for. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-12 21:51 ` Bill Davidsen @ 2006-12-13 17:44 ` Mark Hahn 2006-12-20 4:05 ` Bill Davidsen 0 siblings, 1 reply; 20+ messages in thread From: Mark Hahn @ 2006-12-13 17:44 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-raid >>> which is right at the edge of what I need. I want to read the doc on >>> stripe_cache_size before going huge, if that's K 10MB is a LOT of cache >>> when 256 works perfectly in RAID-0. but they are basically unrelated. in r5/6, the stripe cache is absolutely critical in caching parity chunks. in r0, never functions this way, though it may help some workloads a bit (IOs which aren't naturally aligned to the underlying disk layout.) >>> Any additional input appreciated, I would expect the speed to be (Ndisk >>> - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't as others have reported, you can actually approach that with "naturally" aligned and sized writes. > I'm doing the tests writing 2GB of data to the raw array, in 1MB writes. The > array is RAID-5 with 256 chunk size. I wouldn't really expect any reads, but how many disks? if your 1M writes are to 4 data disks, you stand a chance of streaming (assuming your writes are naturally aligned, or else you'll be somewhat dependent on the stripe cache.) in other words, your whole-stripe size is ndisks*chunksize, and for 256K chunks and, say, 14 disks, that's pretty monstrous... I think that's a factor often overlooked - large chunk sizes, especially with r5/6 AND lots of disks, mean you probably won't ever do "blind" updates, and thus need the r/m/w cycle. in that case, if the stripe cache is not big/smart enough, you'll be limited by reads. I'd like to experiment with this, to see how much benefit you really get from using larger chunk sizes. I'm guessing that past 32K or so, normal *ata systems don't speedup much. fabrics with higher latency or command/arbitration overhead would want larger chunks. > tried was 2K blocks, so I can try other sizes. I have a hard time picturing > why smaller sizes would be better, but that's what testing is for. larger writes (from user-space) generally help, probably up to MB's. smaller chunks help by making it more likley to do blind parity updates; a larger stripe cache can help that too. I think I recall an earlier thread regarding how the stripe cache is used somewhat naively - that all IO goes through it. the most important blocks would be parity and "ends" of a write that partially update an underlying chunk. (conversely, don't bother caching anything which can be blindly written to disk.) regards, mark hahn. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-13 17:44 ` Mark Hahn @ 2006-12-20 4:05 ` Bill Davidsen 0 siblings, 0 replies; 20+ messages in thread From: Bill Davidsen @ 2006-12-20 4:05 UTC (permalink / raw) To: Mark Hahn; +Cc: linux-raid Mark Hahn wrote: >>>> which is right at the edge of what I need. I want to read the doc on >>>> stripe_cache_size before going huge, if that's K 10MB is a LOT of >>>> cache >>>> when 256 works perfectly in RAID-0. > > but they are basically unrelated. in r5/6, the stripe cache is > absolutely > critical in caching parity chunks. in r0, never functions this way, > though > it may help some workloads a bit (IOs which aren't naturally aligned > to the underlying disk layout.) > >>>> Any additional input appreciated, I would expect the speed to be >>>> (Ndisk >>>> - 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't > > as others have reported, you can actually approach that with "naturally" > aligned and sized writes. I don't know what would be natural, I have three drives, 256 chunk size and was originally testing with 1MB writes. I have a hard time seeing a case where there would be a need to read-alter-rewrite, each chunk should be writable as data1, data2, and parity, without readback. I was writing directly to the array, so the data should start on a chunk boundary. Until I went very large on stripe-cache-size performance was almost exactly 100% the write speed of a single drive. There is no obvious way to explain that other than writing one drive at a time. And shrinking write size by factors of two resulted in decreasing performance down to about 13% of the speed of a single drive. Such performance just isn't useful, and going to RAID-10 eliminated the problem, indicating that the RAID-5 implementation is the cause. > >> I'm doing the tests writing 2GB of data to the raw array, in 1MB >> writes. The array is RAID-5 with 256 chunk size. I wouldn't really >> expect any reads, > > but how many disks? if your 1M writes are to 4 data disks, you stand > a chance of streaming (assuming your writes are naturally aligned, or > else you'll be somewhat dependent on the stripe cache.) > in other words, your whole-stripe size is ndisks*chunksize, and for > 256K chunks and, say, 14 disks, that's pretty monstrous... Three drives, so they could be totally isolated from other i/o. > > I think that's a factor often overlooked - large chunk sizes, especially > with r5/6 AND lots of disks, mean you probably won't ever do "blind" > updates, and thus need the r/m/w cycle. in that case, if the stripe > cache > is not big/smart enough, you'll be limited by reads. I didn't have lots of disks, and when the data and parity are all being updated in full chunk increments, there's no reason for a read, since the data won't be needed. I agree that it's probably being read, but needlessly. > > I'd like to experiment with this, to see how much benefit you really > get from using larger chunk sizes. I'm guessing that past 32K > or so, normal *ata systems don't speedup much. fabrics with higher > latency or command/arbitration overhead would want larger chunks. > >> tried was 2K blocks, so I can try other sizes. I have a hard time >> picturing why smaller sizes would be better, but that's what testing >> is for. > > larger writes (from user-space) generally help, probably up to MB's. > smaller chunks help by making it more likley to do blind parity updates; > a larger stripe cache can help that too. I tried 256B to 1MB sizes, 1MB was best, or more correctly least unacceptable. > > I think I recall an earlier thread regarding how the stripe cache is used > somewhat naively - that all IO goes through it. the most important > blocks would be parity and "ends" of a write that partially update an > underlying chunk. (conversely, don't bother caching anything which > can be blindly written to disk.) I fear that last parenthetical isn't being observed. If it weren't for RAID-1 and RAID-10 being fast I wouldn't complain about RAID-5. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Odd (slow) RAID performance 2006-12-08 6:01 ` Neil Brown 2006-12-08 7:28 ` Neil Brown @ 2006-12-09 20:16 ` Bill Davidsen 1 sibling, 0 replies; 20+ messages in thread From: Bill Davidsen @ 2006-12-09 20:16 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Williams, Roger Lucas, linux-raid Neil Brown wrote: > On Monday December 4, dan.j.williams@gmail.com wrote: > >> Here is where I step into supposition territory. Perhaps the >> discrepancy is related to the size of the requests going to the block >> layer. raid5 always makes page sized requests with the expectation >> that they will coalesce into larger requests in the block layer. >> Maybe we are missing coalescing opportunities in raid5 compared to >> what happens in the raid0 case? Are there any io scheduler knobs to >> turn along these lines? >> > > This can be measured. /proc/diskstats reports the number of requests > as well as the number of sectors. > The number of write requests is column 8. The number of write sectors > is column 10. Comparing these you can get an average request size. > > I have found that the average request size is proportional to the size > of the stripe cache (roughly, with limits) but increasing it doesn't > increase through put. > I have measured very slow write throughput for raid5 as well, though > 2.6.18 does seem to have the same problem. I'll double check and do a > git bisect and see what I can come up with. > > NeilBrown Agreed, this is an ongoing problem, not a regression in 2.6.19. But I am writing 50MB/s to a single drive, 3x that to a three way RAID-0 array of those drives, and only 35MB/s to a three drive RAID-5 array. With large writes I know no reread is needed, and yet I get consistently slow write, which gets worse with smaller data writes (2k vs. 1MB for the original test). Read performance is good, I will measure tomorrow and quantify "good," today is shot from ten minutes from now until ~2am, as I have a party to attend, followed by a 'cast to watch. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2006-12-20 4:05 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-11-30 14:13 Odd (slow) RAID performance Bill Davidsen 2006-11-30 14:31 ` Roger Lucas 2006-11-30 15:30 ` Bill Davidsen 2006-11-30 15:32 ` Roger Lucas 2006-11-30 21:09 ` Bill Davidsen 2006-12-01 9:24 ` Roger Lucas 2006-12-02 5:27 ` Bill Davidsen 2006-12-05 1:33 ` Dan Williams 2006-12-07 15:51 ` Bill Davidsen 2006-12-08 1:15 ` Corey Hickey 2006-12-08 8:21 ` Gabor Gombas 2006-12-08 6:01 ` Neil Brown 2006-12-08 7:28 ` Neil Brown 2006-12-09 20:20 ` Bill Davidsen 2006-12-12 17:44 ` Bill Davidsen 2006-12-12 18:48 ` Raz Ben-Jehuda(caro) 2006-12-12 21:51 ` Bill Davidsen 2006-12-13 17:44 ` Mark Hahn 2006-12-20 4:05 ` Bill Davidsen 2006-12-09 20:16 ` Bill Davidsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).