* RAID-5 streaming read performance
@ 2005-07-11 15:11 Dan Christensen
2005-07-13 2:08 ` Ming Zhang
0 siblings, 1 reply; 41+ messages in thread
From: Dan Christensen @ 2005-07-11 15:11 UTC (permalink / raw)
To: linux-raid
I was wondering what I should expect in terms of streaming read
performance when using (software) RAID-5 with four SATA drives. I
thought I would get a noticeable improvement compared to reads from a
single device, but that's not the case. I tested this by using dd to
read 300MB directly from disk partitions /dev/sda7, etc, and also using
dd to read 300MB directly from the raid device (/dev/md2 in this case).
I get around 57MB/s from each of the disk partitions that make up the
raid device, and about 58MB/s from the raid device. On the other
hand, if I run parallel reads from the component partitions, I get
25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s.
Before each read, I try to clear the kernel's cache by reading
900MB from an unrelated partition on the disk. (Is this guaranteed
to work? Is there a better and/or faster way to clear cache?)
I have AAM quiet mode/low performance enabled on the drives, but (a)
this shouldn't matter too much for streaming reads, and (b) it's the
relative performance of the reads from the partitions and the RAID
device that I'm curious about.
I also get poor write performance, but that's harder to isolate
because I have to go through the lvm and filesystem layers too.
I also get poor performance from my RAID-1 array and my other
RAID-5 arrays.
Details of my tests and set-up below.
Thanks for any suggestions,
Dan
System:
- Athlon 2500+
- kernel 2.6.12.2 (also tried 2.6.11.11)
- four SATA drives (3 160G, 1 200G); Samsung Spinpoint
- SiI3114 controller (latency_timer=32 by default; tried 128 too)
- 1G ram
- blockdev --getra /dev/sda --> 256 (didn't play with these)
- blockdev --getra /dev/md2 --> 768 (didn't play with this)
- tried anticipatory, deadline and cfq schedules, with no significant
difference.
- machine essentially idle during tests
Here is part of /proc/mdstat (the full output is below):
md2 : active raid5 sdd5[3] sdc5[2] sdb5[1] sda7[0]
218612160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
Here's the test script and output:
# Clear cache:
dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1
for f in sda7 sdb5 sdc5 sdd5 ; do
echo $f
dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec
echo
done
# Clear cache:
dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1
for f in md2 ; do
echo $f
dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec
echo
done
Output:
sda7
314572800 bytes transferred in 5.401071 seconds (58242671 bytes/sec)
sdb5
314572800 bytes transferred in 5.621170 seconds (55962158 bytes/sec)
sdc5
314572800 bytes transferred in 5.635491 seconds (55819947 bytes/sec)
sdd5
314572800 bytes transferred in 5.333374 seconds (58981951 bytes/sec)
md2
314572800 bytes transferred in 5.386627 seconds (58398846 bytes/sec)
# cat /proc/mdstat
md1 : active raid5 sdd1[2] sdc1[1] sda2[0]
578048 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
md4 : active raid5 sdd2[3] sdc2[2] sdb2[1] sda6[0]
30748032 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md2 : active raid5 sdd5[3] sdc5[2] sdb5[1] sda7[0]
218612160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md3 : active raid5 sdd6[3] sdc6[2] sdb6[1] sda8[0]
218636160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md0 : active raid1 sdb1[0] sda5[1]
289024 blocks [2/2] [UU]
# mdadm --detail /dev/md2
/dev/md2:
Version : 00.90.01
Creation Time : Mon Jul 4 23:54:34 2005
Raid Level : raid5
Array Size : 218612160 (208.48 GiB 223.86 GB)
Device Size : 72870720 (69.49 GiB 74.62 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 2
Persistence : Superblock is persistent
Update Time : Thu Jul 7 21:52:50 2005
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : c4056d19:7b4bb550:44925b88:91d5bc8a
Events : 0.10873823
Number Major Minor RaidDevice State
0 8 7 0 active sync /dev/sda7
1 8 21 1 active sync /dev/sdb5
2 8 37 2 active sync /dev/sdc5
3 8 53 3 active sync /dev/sdd5
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: RAID-5 streaming read performance 2005-07-11 15:11 RAID-5 streaming read performance Dan Christensen @ 2005-07-13 2:08 ` Ming Zhang 2005-07-13 2:52 ` Dan Christensen 0 siblings, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-13 2:08 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID On Mon, 2005-07-11 at 11:11 -0400, Dan Christensen wrote: > I was wondering what I should expect in terms of streaming read > performance when using (software) RAID-5 with four SATA drives. I > thought I would get a noticeable improvement compared to reads from a > single device, but that's not the case. I tested this by using dd to > read 300MB directly from disk partitions /dev/sda7, etc, and also using > dd to read 300MB directly from the raid device (/dev/md2 in this case). > I get around 57MB/s from each of the disk partitions that make up the > raid device, and about 58MB/s from the raid device. On the other > hand, if I run parallel reads from the component partitions, I get > 25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s. > > Before each read, I try to clear the kernel's cache by reading > 900MB from an unrelated partition on the disk. (Is this guaranteed > to work? Is there a better and/or faster way to clear cache?) > > I have AAM quiet mode/low performance enabled on the drives, but (a) > this shouldn't matter too much for streaming reads, and (b) it's the > relative performance of the reads from the partitions and the RAID > device that I'm curious about. > > I also get poor write performance, but that's harder to isolate > because I have to go through the lvm and filesystem layers too. > > I also get poor performance from my RAID-1 array and my other > RAID-5 arrays. > > Details of my tests and set-up below. > > Thanks for any suggestions, > > Dan > > > System: > - Athlon 2500+ > - kernel 2.6.12.2 (also tried 2.6.11.11) > - four SATA drives (3 160G, 1 200G); Samsung Spinpoint > - SiI3114 controller (latency_timer=32 by default; tried 128 too) only 1 card? 4 port? try some other brand card and try to use several cards at the same time. i met some poor cards before. > - 1G ram > - blockdev --getra /dev/sda --> 256 (didn't play with these) > - blockdev --getra /dev/md2 --> 768 (didn't play with this) > - tried anticipatory, deadline and cfq schedules, with no significant > difference. > - machine essentially idle during tests > > Here is part of /proc/mdstat (the full output is below): > > md2 : active raid5 sdd5[3] sdc5[2] sdb5[1] sda7[0] > 218612160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] > > Here's the test script and output: > > # Clear cache: > dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1 > for f in sda7 sdb5 sdc5 sdd5 ; do > echo $f > dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec > echo > done > > # Clear cache: > dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1 > for f in md2 ; do > echo $f > dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec > echo > done > > Output: > > sda7 > 314572800 bytes transferred in 5.401071 seconds (58242671 bytes/sec) > > sdb5 > 314572800 bytes transferred in 5.621170 seconds (55962158 bytes/sec) > > sdc5 > 314572800 bytes transferred in 5.635491 seconds (55819947 bytes/sec) > > sdd5 > 314572800 bytes transferred in 5.333374 seconds (58981951 bytes/sec) > > md2 > 314572800 bytes transferred in 5.386627 seconds (58398846 bytes/sec) > > # cat /proc/mdstat > md1 : active raid5 sdd1[2] sdc1[1] sda2[0] > 578048 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] > > md4 : active raid5 sdd2[3] sdc2[2] sdb2[1] sda6[0] > 30748032 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] > > md2 : active raid5 sdd5[3] sdc5[2] sdb5[1] sda7[0] > 218612160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] > > md3 : active raid5 sdd6[3] sdc6[2] sdb6[1] sda8[0] > 218636160 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] > > md0 : active raid1 sdb1[0] sda5[1] > 289024 blocks [2/2] [UU] > > # mdadm --detail /dev/md2 > /dev/md2: > Version : 00.90.01 > Creation Time : Mon Jul 4 23:54:34 2005 > Raid Level : raid5 > Array Size : 218612160 (208.48 GiB 223.86 GB) > Device Size : 72870720 (69.49 GiB 74.62 GB) > Raid Devices : 4 > Total Devices : 4 > Preferred Minor : 2 > Persistence : Superblock is persistent > > Update Time : Thu Jul 7 21:52:50 2005 > State : clean > Active Devices : 4 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 64K > > UUID : c4056d19:7b4bb550:44925b88:91d5bc8a > Events : 0.10873823 > > Number Major Minor RaidDevice State > 0 8 7 0 active sync /dev/sda7 > 1 8 21 1 active sync /dev/sdb5 > 2 8 37 2 active sync /dev/sdc5 > 3 8 53 3 active sync /dev/sdd5 > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 2:08 ` Ming Zhang @ 2005-07-13 2:52 ` Dan Christensen 2005-07-13 3:15 ` berk walker 2005-07-13 12:24 ` Ming Zhang 0 siblings, 2 replies; 41+ messages in thread From: Dan Christensen @ 2005-07-13 2:52 UTC (permalink / raw) To: mingz; +Cc: Linux RAID Ming Zhang <mingz@ele.uri.edu> writes: > On Mon, 2005-07-11 at 11:11 -0400, Dan Christensen wrote: >> I was wondering what I should expect in terms of streaming read >> performance when using (software) RAID-5 with four SATA drives. I >> thought I would get a noticeable improvement compared to reads from a >> single device, but that's not the case. I tested this by using dd to >> read 300MB directly from disk partitions /dev/sda7, etc, and also using >> dd to read 300MB directly from the raid device (/dev/md2 in this case). >> I get around 57MB/s from each of the disk partitions that make up the >> raid device, and about 58MB/s from the raid device. On the other >> hand, if I run parallel reads from the component partitions, I get >> 25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s. >> >> [...] >> >> System: >> - Athlon 2500+ >> - kernel 2.6.12.2 (also tried 2.6.11.11) >> - four SATA drives (3 160G, 1 200G); Samsung Spinpoint >> - SiI3114 controller (latency_timer=32 by default; tried 128 too) > > only 1 card? 4 port? try some other brand card and try to use several > cards at the same time. i met some poor cards before. Yes, one 4-port controller. It's on the motherboard. I thought that since I get good throughput doing parallel reads from the four drives (see above) that would eliminate the controller as the bottleneck. Am I wrong? Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 2:52 ` Dan Christensen @ 2005-07-13 3:15 ` berk walker 2005-07-13 12:24 ` Ming Zhang 1 sibling, 0 replies; 41+ messages in thread From: berk walker @ 2005-07-13 3:15 UTC (permalink / raw) To: Dan Christensen; +Cc: mingz, Linux RAID Dan Christensen wrote: >Ming Zhang <mingz@ele.uri.edu> writes: > > > >>On Mon, 2005-07-11 at 11:11 -0400, Dan Christensen wrote: >> >> >>>I was wondering what I should expect in terms of streaming read >>>performance when using (software) RAID-5 with four SATA drives. I >>>thought I would get a noticeable improvement compared to reads from a >>>single device, but that's not the case. I tested this by using dd to >>>read 300MB directly from disk partitions /dev/sda7, etc, and also using >>>dd to read 300MB directly from the raid device (/dev/md2 in this case). >>>I get around 57MB/s from each of the disk partitions that make up the >>>raid device, and about 58MB/s from the raid device. On the other >>>hand, if I run parallel reads from the component partitions, I get >>>25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s. >>> >>>[...] >>> >>>System: >>>- Athlon 2500+ >>>- kernel 2.6.12.2 (also tried 2.6.11.11) >>>- four SATA drives (3 160G, 1 200G); Samsung Spinpoint >>>- SiI3114 controller (latency_timer=32 by default; tried 128 too) >>> >>> >>only 1 card? 4 port? try some other brand card and try to use several >>cards at the same time. i met some poor cards before. >> >> > >Yes, one 4-port controller. It's on the motherboard. > >I thought that since I get good throughput doing parallel reads from >the four drives (see above) that would eliminate the controller as the >bottleneck. Am I wrong? > >Dan > > Slavery was abolished in the 1800's. b- >- >To unsubscribe from this list: send the line "unsubscribe linux-raid" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > >. > > > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 2:52 ` Dan Christensen 2005-07-13 3:15 ` berk walker @ 2005-07-13 12:24 ` Ming Zhang 2005-07-13 12:48 ` Dan Christensen 1 sibling, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-13 12:24 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID On Tue, 2005-07-12 at 22:52 -0400, Dan Christensen wrote: > Ming Zhang <mingz@ele.uri.edu> writes: > > > On Mon, 2005-07-11 at 11:11 -0400, Dan Christensen wrote: > >> I was wondering what I should expect in terms of streaming read > >> performance when using (software) RAID-5 with four SATA drives. I > >> thought I would get a noticeable improvement compared to reads from a > >> single device, but that's not the case. I tested this by using dd to > >> read 300MB directly from disk partitions /dev/sda7, etc, and also using > >> dd to read 300MB directly from the raid device (/dev/md2 in this case). > >> I get around 57MB/s from each of the disk partitions that make up the > >> raid device, and about 58MB/s from the raid device. On the other > >> hand, if I run parallel reads from the component partitions, I get > >> 25 to 30MB/s each, so the bus can clearly achieve more than 100MB/s. > >> > >> [...] > >> > >> System: > >> - Athlon 2500+ > >> - kernel 2.6.12.2 (also tried 2.6.11.11) > >> - four SATA drives (3 160G, 1 200G); Samsung Spinpoint > >> - SiI3114 controller (latency_timer=32 by default; tried 128 too) > > > > only 1 card? 4 port? try some other brand card and try to use several > > cards at the same time. i met some poor cards before. > > Yes, one 4-port controller. It's on the motherboard. > > I thought that since I get good throughput doing parallel reads from > the four drives (see above) that would eliminate the controller as the > bottleneck. Am I wrong? > have u try the parallel write? > Dan > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 12:24 ` Ming Zhang @ 2005-07-13 12:48 ` Dan Christensen 2005-07-13 12:52 ` Ming Zhang 2005-07-13 22:42 ` Neil Brown 0 siblings, 2 replies; 41+ messages in thread From: Dan Christensen @ 2005-07-13 12:48 UTC (permalink / raw) To: mingz; +Cc: Linux RAID Ming Zhang <mingz@ele.uri.edu> writes: > have u try the parallel write? I haven't tested it as thoroughly, as it brings lvm and the filesystem into the mix. (The disks are in "production" use, and are fairly full, so I can't do writes directly to the disk partitions/raid device.) My preliminary finding is that raid writes are faster than non-raid writes: 49MB/s vs 39MB/s. Still not stellar performance, though. Question for the list: if I'm doing a long sequential write, naively each parity block will get recalculated and rewritten several times, once for each non-parity block in the stripe. Does the write-caching that the kernel does mean that each parity block will only get written once? Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 12:48 ` Dan Christensen @ 2005-07-13 12:52 ` Ming Zhang 2005-07-13 14:23 ` Dan Christensen 2005-07-13 22:42 ` Neil Brown 1 sibling, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-13 12:52 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID On Wed, 2005-07-13 at 08:48 -0400, Dan Christensen wrote: > Ming Zhang <mingz@ele.uri.edu> writes: > > > have u try the parallel write? > > I haven't tested it as thoroughly, as it brings lvm and the filesystem > into the mix. (The disks are in "production" use, and are fairly > full, so I can't do writes directly to the disk partitions/raid > device.) test on a production environment is too dangerous. :P and many benchmark tool u can not perform as well. LVM overhead is small, but file system overhead is hard to say. > > My preliminary finding is that raid writes are faster than non-raid > writes: 49MB/s vs 39MB/s. Still not stellar performance, though. > Question for the list: if I'm doing a long sequential write, naively > each parity block will get recalculated and rewritten several times, > once for each non-parity block in the stripe. Does the write-caching > that the kernel does mean that each parity block will only get written > once? > if you write sequential, you might see a stripe write thus write only once. but if you write on file system and file system has meta data write, log write, then things become complicated. you can use iostat to see r/w on your disk. > Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 12:52 ` Ming Zhang @ 2005-07-13 14:23 ` Dan Christensen 2005-07-13 14:29 ` Ming Zhang 2005-07-13 18:02 ` David Greaves 0 siblings, 2 replies; 41+ messages in thread From: Dan Christensen @ 2005-07-13 14:23 UTC (permalink / raw) To: mingz; +Cc: Linux RAID Ming Zhang <mingz@ele.uri.edu> writes: > test on a production environment is too dangerous. :P > and many benchmark tool u can not perform as well. Well, I put "production" in quotes because this is just a home mythtv box. :-) So there are plenty of times when it is idle and I can do benchmarks. But I can't erase the hard drives in my tests. > LVM overhead is small, but file system overhead is hard to say. I expected LVM overhead to be small, but in my tests it is very high. I plan to discuss this on the lvm mailing list after I've got the RAID working as well as possible, but as an example: Streaming reads using dd to /dev/null: component partitions, e.g. /dev/sda7: 58MB/s raid device /dev/md2: 59MB/s lvm device /dev/main/media: 34MB/s So something is seriously wrong with my lvm set-up (or with lvm). The lvm device is linearly mapped to the initial blocks of md2, so the last two tests should be reading the same blocks from disk. >> My preliminary finding is that raid writes are faster than non-raid >> writes: 49MB/s vs 39MB/s. Still not stellar performance, though. >> Question for the list: if I'm doing a long sequential write, naively >> each parity block will get recalculated and rewritten several times, >> once for each non-parity block in the stripe. Does the write-caching >> that the kernel does mean that each parity block will only get written >> once? > > if you write sequential, you might see a stripe write thus write only > once. Glad to hear it. In that case, sequential writes to a RAID-5 device with 4 physical drives should be up to 3 times faster than writes to a single device (ignoring journaling, time for calculating parity, bus bandwidth issues, etc). Is this "stripe write" something that the md layer does to optimize things? In other words, does the md layer cache writes and write a stripe at a time when that's possible? Or is this just an automatic effect of the general purpose write-caching that the kernel does? > but if you write on file system and file system has meta data write, log > write, then things become complicated. Yes. For now I'm starting at the bottom and working up... > you can use iostat to see r/w on your disk. Thanks, I'll try that. Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 14:23 ` Dan Christensen @ 2005-07-13 14:29 ` Ming Zhang 2005-07-13 17:56 ` Dan Christensen 2005-07-13 18:02 ` David Greaves 1 sibling, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-13 14:29 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID On Wed, 2005-07-13 at 10:23 -0400, Dan Christensen wrote: > Ming Zhang <mingz@ele.uri.edu> writes: > > > test on a production environment is too dangerous. :P > > and many benchmark tool u can not perform as well. > > Well, I put "production" in quotes because this is just a home mythtv > box. :-) So there are plenty of times when it is idle and I can do > benchmarks. But I can't erase the hard drives in my tests. > > > LVM overhead is small, but file system overhead is hard to say. > > I expected LVM overhead to be small, but in my tests it is very high. > I plan to discuss this on the lvm mailing list after I've got the RAID > working as well as possible, but as an example: > > Streaming reads using dd to /dev/null: > > component partitions, e.g. /dev/sda7: 58MB/s > raid device /dev/md2: 59MB/s > lvm device /dev/main/media: 34MB/s > > So something is seriously wrong with my lvm set-up (or with lvm). The > lvm device is linearly mapped to the initial blocks of md2, so the > last two tests should be reading the same blocks from disk. this is interesting. > > >> My preliminary finding is that raid writes are faster than non-raid > >> writes: 49MB/s vs 39MB/s. Still not stellar performance, though. > >> Question for the list: if I'm doing a long sequential write, naively > >> each parity block will get recalculated and rewritten several times, > >> once for each non-parity block in the stripe. Does the write-caching > >> that the kernel does mean that each parity block will only get written > >> once? > > > > if you write sequential, you might see a stripe write thus write only > > once. > > Glad to hear it. In that case, sequential writes to a RAID-5 device > with 4 physical drives should be up to 3 times faster than writes to a > single device (ignoring journaling, time for calculating parity, bus > bandwidth issues, etc). sounds reasonable but hard to see i feel. > > Is this "stripe write" something that the md layer does to optimize > things? In other words, does the md layer cache writes and write a > stripe at a time when that's possible? Or is this just an automatic > effect of the general purpose write-caching that the kernel does? md people will give you more details. :) > > > but if you write on file system and file system has meta data write, log > > write, then things become complicated. > > Yes. For now I'm starting at the bottom and working up... > > > you can use iostat to see r/w on your disk. > > Thanks, I'll try that. > > Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 14:29 ` Ming Zhang @ 2005-07-13 17:56 ` Dan Christensen 2005-07-13 22:38 ` Neil Brown 0 siblings, 1 reply; 41+ messages in thread From: Dan Christensen @ 2005-07-13 17:56 UTC (permalink / raw) To: linux-raid Here's a question for people running software raid-5: do you get significantly better read speed from a raid-5 device than from it's component partitions/hard drives, using the simple dd test I did? Knowing this will help determine whether something is funny with my set-up and/or hardware, or if just had unrealistic expectations about software raid performance. Feel free to reply directly to me if you don't want to clutter the list. My dumb script is below. Thanks, Dan #!/bin/sh dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1 for f in sda7 sdb5 sdc5 sdd5 ; do echo $f; dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec echo; done dd if=/dev/sda8 of=/dev/null bs=1M count=900 > /dev/null 2>&1 for f in md2 ; do echo $f; dd if=/dev/$f of=/dev/null bs=1M count=300 2>&1 | grep bytes/sec echo; done ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 17:56 ` Dan Christensen @ 2005-07-13 22:38 ` Neil Brown 2005-07-14 0:09 ` Ming Zhang 0 siblings, 1 reply; 41+ messages in thread From: Neil Brown @ 2005-07-13 22:38 UTC (permalink / raw) To: Dan Christensen; +Cc: linux-raid On Wednesday July 13, jdc@uwo.ca wrote: > Here's a question for people running software raid-5: do you get > significantly better read speed from a raid-5 device than from it's > component partitions/hard drives, using the simple dd test I did? SCSI-160 bus, using just 4 of the 15000rpm drives: each drive by itself delivers about 67M/s Three drives in parallel deliver 40M/s each, total of 120M/s 4 give 30M/s each or a total of 120M/s raid5 over 4 drives delivers 132M/s (We've just ordered a SCSI-320 card to make better use of the drives). So with top-quality (and price) hardware, it seems to do the right thing. NeilBrown ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 22:38 ` Neil Brown @ 2005-07-14 0:09 ` Ming Zhang 2005-07-14 1:16 ` Neil Brown 0 siblings, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-14 0:09 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Christensen, Linux RAID On Thu, 2005-07-14 at 08:38 +1000, Neil Brown wrote: > On Wednesday July 13, jdc@uwo.ca wrote: > > Here's a question for people running software raid-5: do you get > > significantly better read speed from a raid-5 device than from it's > > component partitions/hard drives, using the simple dd test I did? > > SCSI-160 bus, using just 4 of the 15000rpm drives: > > each drive by itself delivers about 67M/s > Three drives in parallel deliver 40M/s each, total of 120M/s > 4 give 30M/s each or a total of 120M/s > > raid5 over 4 drives delivers 132M/s why here a 132MB/s instead of 120MB/s (3 * 40MB/s) as u mentioned? any factor lead to this increase? > > (We've just ordered a SCSI-320 card to make better use of the drives). > > So with top-quality (and price) hardware, it seems to do the right > thing. > > NeilBrown > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 0:09 ` Ming Zhang @ 2005-07-14 1:16 ` Neil Brown 2005-07-14 1:25 ` Ming Zhang 0 siblings, 1 reply; 41+ messages in thread From: Neil Brown @ 2005-07-14 1:16 UTC (permalink / raw) To: mingz; +Cc: Dan Christensen, Linux RAID On Wednesday July 13, mingz@ele.uri.edu wrote: > On Thu, 2005-07-14 at 08:38 +1000, Neil Brown wrote: > > On Wednesday July 13, jdc@uwo.ca wrote: > > > Here's a question for people running software raid-5: do you get > > > significantly better read speed from a raid-5 device than from it's > > > component partitions/hard drives, using the simple dd test I did? > > > > SCSI-160 bus, using just 4 of the 15000rpm drives: > > > > each drive by itself delivers about 67M/s > > Three drives in parallel deliver 40M/s each, total of 120M/s > > 4 give 30M/s each or a total of 120M/s > > > > raid5 over 4 drives delivers 132M/s > why here a 132MB/s instead of 120MB/s (3 * 40MB/s) as u mentioned? any > factor lead to this increase? > > I did another test over 10 times the amount of data, and for 34M/s for 4 concurrent individual drives, which multiplies out to 136M/s. The same amount of data of the raid5 gives 137M/s, so I think it was just experimental error. NeilBrown ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 1:16 ` Neil Brown @ 2005-07-14 1:25 ` Ming Zhang 0 siblings, 0 replies; 41+ messages in thread From: Ming Zhang @ 2005-07-14 1:25 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Christensen, Linux RAID On Thu, 2005-07-14 at 11:16 +1000, Neil Brown wrote: > On Wednesday July 13, mingz@ele.uri.edu wrote: > > On Thu, 2005-07-14 at 08:38 +1000, Neil Brown wrote: > > > On Wednesday July 13, jdc@uwo.ca wrote: > > > > Here's a question for people running software raid-5: do you get > > > > significantly better read speed from a raid-5 device than from it's > > > > component partitions/hard drives, using the simple dd test I did? > > > > > > SCSI-160 bus, using just 4 of the 15000rpm drives: > > > > > > each drive by itself delivers about 67M/s > > > Three drives in parallel deliver 40M/s each, total of 120M/s > > > 4 give 30M/s each or a total of 120M/s > > > > > > raid5 over 4 drives delivers 132M/s > > why here a 132MB/s instead of 120MB/s (3 * 40MB/s) as u mentioned? any > > factor lead to this increase? > > > > > > I did another test over 10 times the amount of data, and for 34M/s for > 4 concurrent individual drives, which multiplies out to 136M/s. The > same amount of data of the raid5 gives 137M/s, so I think it was just > experimental error. ic. thanks for explanation. yes, agree. it seems that u can get a near linear performance with decent SCSI HW while what we can get from SATA is not good. :P Ming > > NeilBrown ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 14:23 ` Dan Christensen 2005-07-13 14:29 ` Ming Zhang @ 2005-07-13 18:02 ` David Greaves 2005-07-13 18:14 ` Ming Zhang 2005-07-14 3:58 ` Dan Christensen 1 sibling, 2 replies; 41+ messages in thread From: David Greaves @ 2005-07-13 18:02 UTC (permalink / raw) To: Dan Christensen; +Cc: mingz, Linux RAID Dan Christensen wrote: >Ming Zhang <mingz@ele.uri.edu> writes: > > > >>test on a production environment is too dangerous. :P >>and many benchmark tool u can not perform as well. >> >> > >Well, I put "production" in quotes because this is just a home mythtv >box. :-) So there are plenty of times when it is idle and I can do >benchmarks. But I can't erase the hard drives in my tests. > > Me too. >>LVM overhead is small, but file system overhead is hard to say. >> >> >I expected LVM overhead to be small, but in my tests it is very high. >I plan to discuss this on the lvm mailing list after I've got the RAID >working as well as possible, but as an example: > >Streaming reads using dd to /dev/null: > >component partitions, e.g. /dev/sda7: 58MB/s >raid device /dev/md2: 59MB/s >lvm device /dev/main/media: 34MB/s > > This is not my experience. What are the readahead settings? I found significant variation in performance by varying the readahead at raw, md and lvm device level In my setup I get component partitions, e.g. /dev/sda7: 39MB/s raid device /dev/md2: 31MB/s lvm device /dev/main/media: 53MB/s (oldish system - but note that lvm device is *much* faster) For your entertainment you may like to try this to 'tune' your readahead - it's OK to use so long as you're not recording: (FYI I find that setting readahead to 0 on all devices and 4096 on the lvm device gets me the best performance - which makes sense if you think about it...) #!/bin/bash RAW_DEVS="/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/hdb" MD_DEVS=/dev/md0 LV_DEVS=/dev/huge_vg/huge_lv LV_RAS="0 128 256 1024 4096 8192" MD_RAS="0 128 256 1024 4096 8192" RAW_RAS="0 128 256 1024 4096 8192" function show_ra() { for i in $RAW_DEVS $MD_DEVS $LV_DEVS do echo -n "$i `blockdev --getra $i` :: " done echo } function set_ra() { RA=$1 shift for dev in $@ do blockdev --setra $RA $dev done } function show_performance() { COUNT=4000000 dd if=$LV_DEVS of=/dev/null count=$COUNT 2>&1 | grep seconds } for RAW_RA in $RAW_RAS do set_ra $RAW_RA $RAW_DEVS for MD_RA in $MD_RAS do set_ra $MD_RA $MD_DEVS for LV_RA in $LV_RAS do set_ra $LV_RA $LV_DEVS show_ra show_performance done done done ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 18:02 ` David Greaves @ 2005-07-13 18:14 ` Ming Zhang 2005-07-13 21:18 ` David Greaves 2005-07-14 3:58 ` Dan Christensen 1 sibling, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-13 18:14 UTC (permalink / raw) To: David Greaves; +Cc: Dan Christensen, Linux RAID On Wed, 2005-07-13 at 19:02 +0100, David Greaves wrote: > Dan Christensen wrote: > > >Ming Zhang <mingz@ele.uri.edu> writes: > > > > > > > >>test on a production environment is too dangerous. :P > >>and many benchmark tool u can not perform as well. > >> > >> > > > >Well, I put "production" in quotes because this is just a home mythtv > >box. :-) So there are plenty of times when it is idle and I can do > >benchmarks. But I can't erase the hard drives in my tests. > > > > > Me too. > > >>LVM overhead is small, but file system overhead is hard to say. > >> > >> > >I expected LVM overhead to be small, but in my tests it is very high. > >I plan to discuss this on the lvm mailing list after I've got the RAID > >working as well as possible, but as an example: > > > >Streaming reads using dd to /dev/null: > > > >component partitions, e.g. /dev/sda7: 58MB/s > >raid device /dev/md2: 59MB/s > >lvm device /dev/main/media: 34MB/s > > > > > This is not my experience. > What are the readahead settings? > I found significant variation in performance by varying the readahead at > raw, md and lvm device level > > In my setup I get > > component partitions, e.g. /dev/sda7: 39MB/s > raid device /dev/md2: 31MB/s > lvm device /dev/main/media: 53MB/s > > (oldish system - but note that lvm device is *much* faster) this is so interesting to see! seems that some read ahead parameters have negative impact. > > For your entertainment you may like to try this to 'tune' your readahead > - it's OK to use so long as you're not recording: > > (FYI I find that setting readahead to 0 on all devices and 4096 on the > lvm device gets me the best performance - which makes sense if you think > about it...) > > #!/bin/bash > RAW_DEVS="/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/hdb" > MD_DEVS=/dev/md0 > LV_DEVS=/dev/huge_vg/huge_lv > > LV_RAS="0 128 256 1024 4096 8192" > MD_RAS="0 128 256 1024 4096 8192" > RAW_RAS="0 128 256 1024 4096 8192" > > function show_ra() > { > for i in $RAW_DEVS $MD_DEVS $LV_DEVS > do echo -n "$i `blockdev --getra $i` :: " > done > echo > } > > function set_ra() > { > RA=$1 > shift > for dev in $@ > do > blockdev --setra $RA $dev > done > } > > function show_performance() > { > COUNT=4000000 > dd if=$LV_DEVS of=/dev/null count=$COUNT 2>&1 | grep seconds > } > > for RAW_RA in $RAW_RAS > do > set_ra $RAW_RA $RAW_DEVS > for MD_RA in $MD_RAS > do > set_ra $MD_RA $MD_DEVS > for LV_RA in $LV_RAS > do > set_ra $LV_RA $LV_DEVS > show_ra > show_performance > done > done > done > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 18:14 ` Ming Zhang @ 2005-07-13 21:18 ` David Greaves 2005-07-13 21:44 ` Ming Zhang 2005-07-13 22:52 ` Neil Brown 0 siblings, 2 replies; 41+ messages in thread From: David Greaves @ 2005-07-13 21:18 UTC (permalink / raw) To: mingz; +Cc: Dan Christensen, Linux RAID Ming Zhang wrote: >>component partitions, e.g. /dev/sda7: 39MB/s >>raid device /dev/md2: 31MB/s >>lvm device /dev/main/media: 53MB/s >> >>(oldish system - but note that lvm device is *much* faster) >> >> > >this is so interesting to see! seems that some read ahead parameters >have negative impact. > > I guess each raw device does some readahead, then the md0 does some readahead and then the lvm does some readahead. Theoretically the md0 and lvm should overlap - but I guess that much of the raw device level readahead is discarded. David -- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 21:18 ` David Greaves @ 2005-07-13 21:44 ` Ming Zhang 2005-07-13 21:50 ` David Greaves 2005-07-13 22:52 ` Neil Brown 1 sibling, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-13 21:44 UTC (permalink / raw) To: David Greaves; +Cc: Dan Christensen, Linux RAID On Wed, 2005-07-13 at 22:18 +0100, David Greaves wrote: > Ming Zhang wrote: > > >>component partitions, e.g. /dev/sda7: 39MB/s > >>raid device /dev/md2: 31MB/s > >>lvm device /dev/main/media: 53MB/s > >> > >>(oldish system - but note that lvm device is *much* faster) > >> > >> > > > >this is so interesting to see! seems that some read ahead parameters > >have negative impact. > > > > > I guess each raw device does some readahead, then the md0 does some > readahead and then the lvm does some readahead. Theoretically the md0 > and lvm should overlap - but I guess that much of the raw device level > readahead is discarded. > > David > for a streaming read, what you readahead now will always be used exact once in near future. at least i think raw device readahead can be turned on at the same time with one of OS components, raid or lvm, readahead being turned on. but in your case, u get best result when turn only one on. ming ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 21:44 ` Ming Zhang @ 2005-07-13 21:50 ` David Greaves 2005-07-13 21:55 ` Ming Zhang 0 siblings, 1 reply; 41+ messages in thread From: David Greaves @ 2005-07-13 21:50 UTC (permalink / raw) To: mingz; +Cc: Dan Christensen, Linux RAID Ming Zhang wrote: >On Wed, 2005-07-13 at 22:18 +0100, David Greaves wrote: > > >>Ming Zhang wrote: >> >> >> >>>>component partitions, e.g. /dev/sda7: 39MB/s >>>>raid device /dev/md2: 31MB/s >>>>lvm device /dev/main/media: 53MB/s >>>> >>>>(oldish system - but note that lvm device is *much* faster) >>>> >>>> >>>> >>>> >>>this is so interesting to see! seems that some read ahead parameters >>>have negative impact. >>> >>> >>> >>> >>I guess each raw device does some readahead, then the md0 does some >>readahead and then the lvm does some readahead. Theoretically the md0 >>and lvm should overlap - but I guess that much of the raw device level >>readahead is discarded. >> >>David >> >> >> >for a streaming read, what you readahead now will always be used exact >once in near future. at least i think raw device readahead can be turned >on at the same time with one of OS components, raid or lvm, readahead >being turned on. but in your case, u get best result when turn only one >on. > >ming > > > > > I doubt it's just me - what results do others get with that script? David -- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 21:50 ` David Greaves @ 2005-07-13 21:55 ` Ming Zhang 0 siblings, 0 replies; 41+ messages in thread From: Ming Zhang @ 2005-07-13 21:55 UTC (permalink / raw) To: David Greaves; +Cc: Dan Christensen, Linux RAID On Wed, 2005-07-13 at 22:50 +0100, David Greaves wrote: > Ming Zhang wrote: > > >On Wed, 2005-07-13 at 22:18 +0100, David Greaves wrote: > > > > > >>Ming Zhang wrote: > >> > >> > >> > >>>>component partitions, e.g. /dev/sda7: 39MB/s > >>>>raid device /dev/md2: 31MB/s > >>>>lvm device /dev/main/media: 53MB/s > >>>> > >>>>(oldish system - but note that lvm device is *much* faster) > >>>> > >>>> > >>>> > >>>> > >>>this is so interesting to see! seems that some read ahead parameters > >>>have negative impact. > >>> > >>> > >>> > >>> > >>I guess each raw device does some readahead, then the md0 does some > >>readahead and then the lvm does some readahead. Theoretically the md0 > >>and lvm should overlap - but I guess that much of the raw device level > >>readahead is discarded. > >> > >>David > >> > >> > >> > >for a streaming read, what you readahead now will always be used exact > >once in near future. at least i think raw device readahead can be turned > >on at the same time with one of OS components, raid or lvm, readahead > >being turned on. but in your case, u get best result when turn only one > >on. > > > >ming > > > > > > > > > > > I doubt it's just me - what results do others get with that script? > > David > my box is in use now. i might try it tomorrow to see what happen. :P Ming ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 21:18 ` David Greaves 2005-07-13 21:44 ` Ming Zhang @ 2005-07-13 22:52 ` Neil Brown 1 sibling, 0 replies; 41+ messages in thread From: Neil Brown @ 2005-07-13 22:52 UTC (permalink / raw) To: David Greaves; +Cc: mingz, Dan Christensen, Linux RAID On Wednesday July 13, david@dgreaves.com wrote: > I guess each raw device does some readahead, then the md0 does some > readahead and then the lvm does some readahead. Theoretically the md0 > and lvm should overlap - but I guess that much of the raw device level > readahead is discarded. No. Devices don't to readahead (well, modern drives may well read-ahead into an on-drive buffer, but that is completely transparent and separate from any readahead that linux does). Each device just declares how much readahead it thinks is appropriate for that devices. The linux mm layer does read-ahead by requesting devices for blocks that haven't actually been asked for by upper layers. The amount of readahead depends on the behaviour of the app doing the reads, and the setting declared by the devices. raid5 declares a read-ahead size of twice the stripe size. i.e. chunks * (disks-1) * 2. Possibly it should make it bigger if the underlying devices would all be happy with that, however I haven't given the issue a lot of thought, and it is tunable from userspace. NeilBrown ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 18:02 ` David Greaves 2005-07-13 18:14 ` Ming Zhang @ 2005-07-14 3:58 ` Dan Christensen 2005-07-14 4:13 ` Mark Hahn ` (2 more replies) 1 sibling, 3 replies; 41+ messages in thread From: Dan Christensen @ 2005-07-14 3:58 UTC (permalink / raw) To: David Greaves, Linux RAID; +Cc: mingz David Greaves <david@dgreaves.com> writes: > In my setup I get > > component partitions, e.g. /dev/sda7: 39MB/s > raid device /dev/md2: 31MB/s > lvm device /dev/main/media: 53MB/s > > (oldish system - but note that lvm device is *much* faster) Did you test component device and raid device speed using the read-ahead settings tuned for lvm reads? If so, that's not a fair comparison. :-) > For your entertainment you may like to try this to 'tune' your readahead > - it's OK to use so long as you're not recording: Thanks, I played around with that a lot. I tuned readahead to optimize lvm device reads, and this improved things greatly. It turns out the default lvm settings had readahead set to 0! But by tuning things, I could get my read speed up to 59MB/s. This is with raw device readahead 256, md device readahead 1024 and lvm readahead 2048. (The speed was most sensitive to the last one, but did seem to depend on the other ones a bit too.) I separately tuned the raid device read speed. To maximize this, I needed to set the raw device readahead to 1024 and the raid device readahead to 4096. This brought my raid read speed from 59MB/s to 78MB/s. Better! (But note that now this makes the lvm read speed look bad.) My raw device read speed is independent of the readahead setting, as long as it is at least 256. The speed is about 58MB/s. Summary: raw device: 58MB/s raid device: 78MB/s lvm device: 59MB/s raid still isn't achieving the 106MB/s that I can get with parallel direct reads, but at least it's getting closer. As a simple test, I wrote a program like dd that reads and discards 64k chunks of data from a device, but which skips 1 out of every four chunks (simulating skipping parity blocks). It's not surprising that this program can only read from a raw device at about 75% the rate of dd, since the kernel readahead is probably causing the skipped blocks to be read anyways (or maybe because the disk head has to pass over those sections of the disk anyways). I then ran four copies of this program in parallel, reading from the raw devices that make up my raid partition. And, like md, they only achieved about 78MB/s. This is very close to 75% of 106MB/s. Again, not surprising, since I need to have raw device readahead turned on for this to be efficient at all, so 25% of the chunks that pass through the controller are ignored. But I still don't understand why the md layer can't do better. If I turn off readahead of the raw devices, and keep it for the raid device, then parity blocks should never be requested, so they shouldn't use any bus/controller bandwidth. And even if each drive is only acting at 75% efficiency, the four drives should still be able to saturate the bus/controller. So I can't figure out what's going on here. Is there a way for me to simulate readahead in userspace, i.e. can I do lots of sequential asynchronous reads in parallel? Also, is there a way to disable caching of reads? Having to clear the cache by reading 900M each time slows down testing. I guess I could reboot with mem=100M, but it'd be nice to disable/enable caching on the fly. Hmm, maybe I can just run something like memtest which locks a bunch of ram... Thanks for all of the help so far! Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 3:58 ` Dan Christensen @ 2005-07-14 4:13 ` Mark Hahn 2005-07-14 21:16 ` Dan Christensen 2005-07-14 12:30 ` Ming Zhang 2005-07-15 2:38 ` Dan Christensen 2 siblings, 1 reply; 41+ messages in thread From: Mark Hahn @ 2005-07-14 4:13 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID [-- Attachment #1: Type: TEXT/PLAIN, Size: 1321 bytes --] > > component partitions, e.g. /dev/sda7: 39MB/s > > raid device /dev/md2: 31MB/s > > lvm device /dev/main/media: 53MB/s > > > > (oldish system - but note that lvm device is *much* faster) > > Did you test component device and raid device speed using the > read-ahead settings tuned for lvm reads? If so, that's not a fair > comparison. :-) I did an eval with a vendor who claimed that their lvm actually improved bandwidth because it somehow triggered better full-stripe operations, or readahead, or something. filtered through a marketing person, of course ;( > Is there a way for me to simulate readahead in userspace, i.e. can > I do lots of sequential asynchronous reads in parallel? there is async IO, but I don't think this is going to help you much. > Also, is there a way to disable caching of reads? Having to clear yes: O_DIRECT. I'm attaching a little program I wrote which basically just shows you incremental bandwidth. you can use it to show the zones on a disk (just iorate -r -l 9999 /dev/hda and plot the results), or to do normal r/w bandwidth without being confused by the page-cache. you can even use it as a filter to measure tape backup performance. it doesn't try to do anything with random seeks. it doesn't do anything multi-stream. regards, mark hahn. [-- Attachment #2: Type: TEXT/PLAIN, Size: 5440 bytes --] /* iorate.c - measure rates of sequential IO, showing incremental bandwidth written by Mark Hahn (hahn@mcmaster.ca) 2003,2004,2005 the main point of this code is to illustrate the danger of running naive bandwidth tests on files that are small relative to the memory/disk bandwidth ratio of your system. that is, on any system, the incremental bandwidth will start out huge, since IO is purely to the page cache. once you exceed that size, bandwidth will be dominated by the real disk performance. but using the average of these two modes is a mistake, even if you use very large files. */ #define _LARGEFILE64_SOURCE 1 #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <errno.h> #include <sys/time.h> #include <sys/fcntl.h> #include <sys/stat.h> #include <stdarg.h> #include <string.h> #include <sys/mman.h> #ifdef O_LARGEFILE #define LF O_LARGEFILE #elif defined(_O_LARGEFILE) #define LF _O_LARGEFILE #else #define LF 0 #endif #ifndef O_DIRECT #define O_DIRECT 040000 #endif typedef unsigned long long u64; u64 bytes = 0, bytesLast = 0; double timeStart = 0, timeLast = 0; /* default reporting interval is every 2 seconds; in 2004, an entry-level desktop disk will sustain around 50 MB/s, so the default bytes interval is 100 MB. whichever comes first. */ u64 byteInterval = 100; double timeInterval = 2; double gtod() { struct timeval tv; gettimeofday(&tv,0); return tv.tv_sec + 1e-6 * tv.tv_usec; } void dumpstats(int force) { u64 db = bytes - bytesLast; double now = gtod(); double dt; static int first = 1; if (timeLast == 0) timeStart = timeLast = now; dt = now - timeLast; if (!force && db < byteInterval && dt < timeInterval) return; if (first) { printf("#%7s %7s %7s %7s\n", "secs", "MB", "MB/sec", "MB/sec"); first = 0; } printf("%7.3f %7.3f %7.3f %7.3f\n", now - timeStart, 1e-6 * bytes, 1e-6 * db / dt, 1e-6 * bytes / (now-timeStart)); timeLast = now; bytesLast = bytes; } void usage() { fprintf(stderr,"iorate [-r/w filename] [-d] [-c chunksz][-b byteivl][-t ivl][-l szlim] [-r in] [-w out]\n"); fprintf(stderr,"-r in or -w out select which file is read or written ('-' for stdin/out)\n"); fprintf(stderr,"-c chunksz - size of chunks written (KB);\n"); fprintf(stderr,"-t timeinterval - collect rate each timeinterval seconds;\n"); fprintf(stderr,"-b byteinterval - collect rate each byteinterval MB;\n"); fprintf(stderr,"-l limit - total output size limit (MB);\n"); fprintf(stderr,"-d use O_DIRECT\n"); fprintf(stderr,"defaults are: '-c 8 -b 20 -t 10 -l 10'\n"); exit(1); } void fatal(char *format, ...) { va_list ap; va_start(ap,format); vfprintf(stderr,format,ap); fprintf(stderr,": errno=%d (%s)\n",errno,strerror(errno)); va_end(ap); dumpstats(1); exit(1); } /* allocate a buffer using mmap to ensure it's page-aligned. O_DIRECT *could* be more strict than that, but probably isn't */ void *myalloc(unsigned size) { unsigned s = (size + 4095) & ~4095U; void *p = mmap(0, s, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (p == MAP_FAILED) return 0; return p; } int main(int argc, char *argv[]) { unsigned size = 8; char *buffer; u64 limit = 10; char *fnameI = 0; char *fnameO = 0; int fdI = 0; int fdO = 1; int doRead = 0; int doWrite = 0; int doDirect = 0; int letter; while ((letter = getopt(argc,argv,"r:w:b:c:l:t:d")) != -1) { switch(letter) { case 'r': fnameI = optarg; doRead = 1; break; case 'w': fnameO = optarg; doWrite = 1; break; case 'b': byteInterval = atoi(optarg); break; case 'c': size = atoi(optarg); break; case 'l': limit = atoi(optarg); break; case 't': timeInterval = atof(optarg); break; case 'd': doDirect = 1; break; default: usage(); } } if (argc != optind) usage(); byteInterval *= 1e6; limit *= 1e6; size *= 1024; setbuf(stdout, 0); fprintf(stderr,"chunk %dK, byteInterval %uM, timeInterval %f, limit %uM\n", size>>10, (unsigned)(byteInterval>>20), timeInterval, (unsigned)(limit>>20)); if (doRead && fnameI && strcmp(fnameI,"-")) { fdI = open(fnameI, O_RDONLY | LF); if (fdI == -1) fatal("open(read) failed"); } if (doWrite && fnameO && strcmp(fnameO,"-")) { int flags = O_RDWR | O_CREAT | LF; if (doDirect) flags |= O_DIRECT; fdO = open(fnameO, flags, 0600); if (fdO == -1) fatal("open(write) failed"); } buffer = myalloc(size); memset(buffer,'m',size); timeStart = timeLast = gtod(); bytes = 0; while (bytes < limit) { int c = size; dumpstats(0); if (doRead) { c = read(fdI,buffer,c); if (c == -1) fatal("read failed"); } if (doWrite) { c = write(fdO,buffer,c); if (c == -1) fatal("write failed"); } bytes += c; /* short read/write means EOF. */ if (c < size) break; } dumpstats(1); return 0; } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 4:13 ` Mark Hahn @ 2005-07-14 21:16 ` Dan Christensen 2005-07-14 21:30 ` Ming Zhang 0 siblings, 1 reply; 41+ messages in thread From: Dan Christensen @ 2005-07-14 21:16 UTC (permalink / raw) To: linux-raid Mark Hahn <hahn@physics.mcmaster.ca> writes: >> Is there a way for me to simulate readahead in userspace, i.e. can >> I do lots of sequential asynchronous reads in parallel? > > there is async IO, but I don't think this is going to help you much. > >> Also, is there a way to disable caching of reads? Having to clear > > yes: O_DIRECT. That might disable caching of reads, but it also disables readahead, so unless I manually use aio to simulate readahead, this isn't going to solve my problem, which is having to clear the cache before each test to get relevant results. I'm really surprised there isn't something in /proc you can use to clear or disable the cache. Would be very useful for benchmarking! Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 21:16 ` Dan Christensen @ 2005-07-14 21:30 ` Ming Zhang 2005-07-14 23:29 ` Mark Hahn 0 siblings, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-14 21:30 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID i also want a way to clear part of the whole page cache by file id. :) i also want a way to tell the cache distribution, how many for file A and B, .... ming On Thu, 2005-07-14 at 17:16 -0400, Dan Christensen wrote: > I'm really surprised there isn't something in /proc you can use to > clear or disable the cache. Would be very useful for benchmarking! > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 21:30 ` Ming Zhang @ 2005-07-14 23:29 ` Mark Hahn 2005-07-15 1:23 ` Ming Zhang 0 siblings, 1 reply; 41+ messages in thread From: Mark Hahn @ 2005-07-14 23:29 UTC (permalink / raw) To: Ming Zhang; +Cc: Dan Christensen, Linux RAID > i also want a way to clear part of the whole page cache by file id. :) understandably, kernel developers are don't high-prioritize this sort of not-useful-for-normal-work feature. > i also want a way to tell the cache distribution, how many for file A > and B, .... you should probably try mmaping the file and using mincore. come to think of it, mmap+madvise might be a sensible way to flush pages corresponding to a particular file, as well. > > I'm really surprised there isn't something in /proc you can use to > > clear or disable the cache. Would be very useful for benchmarking! I assume you noticed "blockdev --flushbufs", no? it works for me (ie, a small, repeated streaming read of a disk device will show pagecache speed). I think the problem is that it's difficult to dissociate readahead, writebehind and normal lru-ish caching. there was quite a flurry of activity around 2.4.10 related to this, and it left a bad taste in everyone's mouth. I think the main conclusion was that too much fanciness results in a fragile, more subtle and difficult-to-maintain system that performs better, true, but over a narrower range of workloads. regards, mark hahn sharcnet/mcmaster. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 23:29 ` Mark Hahn @ 2005-07-15 1:23 ` Ming Zhang 2005-07-15 2:11 ` Dan Christensen 0 siblings, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-15 1:23 UTC (permalink / raw) To: Mark Hahn; +Cc: Dan Christensen, Linux RAID On Thu, 2005-07-14 at 19:29 -0400, Mark Hahn wrote: > > i also want a way to clear part of the whole page cache by file id. :) > > understandably, kernel developers are don't high-prioritize this sort of > not-useful-for-normal-work feature. agree. > > > i also want a way to tell the cache distribution, how many for file A > > and B, .... > > you should probably try mmaping the file and using mincore. > come to think of it, mmap+madvise might be a sensible way to > flush pages corresponding to a particular file, as well. i prefer a generic way. :) it will be useful to tune the system. maybe write a program to iterate the kernel structure can do this. > > > > I'm really surprised there isn't something in /proc you can use to > > > clear or disable the cache. Would be very useful for benchmarking! > > I assume you noticed "blockdev --flushbufs", no? it works for me > (ie, a small, repeated streaming read of a disk device will show > pagecache speed). it will do flush right? but will it flush and clean cache? > > I think the problem is that it's difficult to dissociate readahead, > writebehind and normal lru-ish caching. there was quite a flurry of > activity around 2.4.10 related to this, and it left a bad taste in > everyone's mouth. I think the main conclusion was that too much fanciness > results in a fragile, more subtle and difficult-to-maintain system > that performs better, true, but over a narrower range of workloads. maybe this will happen again for 2.6.x? i think there are still many gray areas that can be checked. also many places can be improved. a test i did show that even you have sda and sdb to form a raid0, the page cache for sda and sdb will not be used by raid0. kind of funny. > > regards, mark hahn > sharcnet/mcmaster. > thx! Ming ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-15 1:23 ` Ming Zhang @ 2005-07-15 2:11 ` Dan Christensen 2005-07-15 12:28 ` Ming Zhang 0 siblings, 1 reply; 41+ messages in thread From: Dan Christensen @ 2005-07-15 2:11 UTC (permalink / raw) To: linux-raid Ming Zhang <mingz@ele.uri.edu> writes: > On Thu, 2005-07-14 at 19:29 -0400, Mark Hahn wrote: >> >> > i also want a way to clear part of the whole page cache by file id. :) >> >> understandably, kernel developers are don't high-prioritize this sort of >> not-useful-for-normal-work feature. > agree. Clearing just part of the page cache sounds too complicated to be worth it, but clearing it all seems reasonable; some kernel developers spend time doing benchmarks too! >> > Dan Christensen wrote: >> > >> > > I'm really surprised there isn't something in /proc you can use to >> > > clear or disable the cache. Would be very useful for benchmarking! >> >> I assume you noticed "blockdev --flushbufs", no? it works for me I had tried this and noticed that it didn't work for files on a filesystem. But it does seem to work for block devices. That's great, thanks. I didn't realize the cache was so complicated; it can be retained for files but not for the block device underlying those files! > a test i did show that even you have sda and sdb to form a raid0, > the page cache for sda and sdb will not be used by raid0. kind of > funny. I thought I had noticed raid devices making use of cache from underlying devices, but a test I just did agrees with your result, for both RAID-1 and RAID-5. Again, this seems odd. Shouldn't the raid layer take advantage of a block that's already in RAM? I guess this won't matter in practice, since you usually don't read from both a raid device and an underlying device. Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-15 2:11 ` Dan Christensen @ 2005-07-15 12:28 ` Ming Zhang 0 siblings, 0 replies; 41+ messages in thread From: Ming Zhang @ 2005-07-15 12:28 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID On Thu, 2005-07-14 at 22:11 -0400, Dan Christensen wrote: > Ming Zhang <mingz@ele.uri.edu> writes: > > > On Thu, 2005-07-14 at 19:29 -0400, Mark Hahn wrote: > >> > >> > i also want a way to clear part of the whole page cache by file id. :) > >> > >> understandably, kernel developers are don't high-prioritize this sort of > >> not-useful-for-normal-work feature. > > agree. > > Clearing just part of the page cache sounds too complicated to be > worth it, but clearing it all seems reasonable; some kernel developers > spend time doing benchmarks too! maybe they do not care to run a program to clear it every time. :P > > >> > Dan Christensen wrote: > >> > > >> > > I'm really surprised there isn't something in /proc you can use to > >> > > clear or disable the cache. Would be very useful for benchmarking! > >> > >> I assume you noticed "blockdev --flushbufs", no? it works for me > > I had tried this and noticed that it didn't work for files on a > filesystem. But it does seem to work for block devices. That's > great, thanks. I didn't realize the cache was so complicated; > it can be retained for files but not for the block device underlying > those files! yes, that is the why the command name is blockdev. :) i guess for files we just need to call fsync system call? is that call work on block device as well? > > > a test i did show that even you have sda and sdb to form a raid0, > > the page cache for sda and sdb will not be used by raid0. kind of > > funny. > > I thought I had noticed raid devices making use of cache from > underlying devices, but a test I just did agrees with your result, for > both RAID-1 and RAID-5. Again, this seems odd. Shouldn't the raid > layer take advantage of a block that's already in RAM? I guess this > won't matter in practice, since you usually don't read from both a > raid device and an underlying device. you are right, that is weired in real world. ming ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 3:58 ` Dan Christensen 2005-07-14 4:13 ` Mark Hahn @ 2005-07-14 12:30 ` Ming Zhang 2005-07-14 14:23 ` Ming Zhang 2005-07-14 17:54 ` Dan Christensen 2005-07-15 2:38 ` Dan Christensen 2 siblings, 2 replies; 41+ messages in thread From: Ming Zhang @ 2005-07-14 12:30 UTC (permalink / raw) To: Dan Christensen; +Cc: David Greaves, Linux RAID On Wed, 2005-07-13 at 23:58 -0400, Dan Christensen wrote: > David Greaves <david@dgreaves.com> writes: > > > In my setup I get > > > > component partitions, e.g. /dev/sda7: 39MB/s > > raid device /dev/md2: 31MB/s > > lvm device /dev/main/media: 53MB/s > > > > (oldish system - but note that lvm device is *much* faster) > > Did you test component device and raid device speed using the > read-ahead settings tuned for lvm reads? If so, that's not a fair > comparison. :-) > > > For your entertainment you may like to try this to 'tune' your readahead > > - it's OK to use so long as you're not recording: > > Thanks, I played around with that a lot. I tuned readahead to > optimize lvm device reads, and this improved things greatly. It turns > out the default lvm settings had readahead set to 0! But by tuning > things, I could get my read speed up to 59MB/s. This is with raw > device readahead 256, md device readahead 1024 and lvm readahead 2048. > (The speed was most sensitive to the last one, but did seem to depend > on the other ones a bit too.) > > I separately tuned the raid device read speed. To maximize this, I > needed to set the raw device readahead to 1024 and the raid device > readahead to 4096. This brought my raid read speed from 59MB/s to > 78MB/s. Better! (But note that now this makes the lvm read speed > look bad.) > > My raw device read speed is independent of the readahead setting, > as long as it is at least 256. The speed is about 58MB/s. > > Summary: > > raw device: 58MB/s > raid device: 78MB/s > lvm device: 59MB/s > > raid still isn't achieving the 106MB/s that I can get with parallel > direct reads, but at least it's getting closer. > > As a simple test, I wrote a program like dd that reads and discards > 64k chunks of data from a device, but which skips 1 out of every four > chunks (simulating skipping parity blocks). It's not surprising that > this program can only read from a raw device at about 75% the rate of > dd, since the kernel readahead is probably causing the skipped blocks > to be read anyways (or maybe because the disk head has to pass over > those sections of the disk anyways). > > I then ran four copies of this program in parallel, reading from the > raw devices that make up my raid partition. And, like md, they only > achieved about 78MB/s. This is very close to 75% of 106MB/s. Again, > not surprising, since I need to have raw device readahead turned on > for this to be efficient at all, so 25% of the chunks that pass > through the controller are ignored. > > But I still don't understand why the md layer can't do better. If I > turn off readahead of the raw devices, and keep it for the raid > device, then parity blocks should never be requested, so they > shouldn't use any bus/controller bandwidth. And even if each drive is > only acting at 75% efficiency, the four drives should still be able to > saturate the bus/controller. So I can't figure out what's going on > here. when read, i do not think MD will read parity at all. but since parity is on all disk, there might be a seek here. so you might want to try a raid4 to see what happen as well. > > Is there a way for me to simulate readahead in userspace, i.e. can > I do lots of sequential asynchronous reads in parallel? > > Also, is there a way to disable caching of reads? Having to clear > the cache by reading 900M each time slows down testing. I guess > I could reboot with mem=100M, but it'd be nice to disable/enable > caching on the fly. Hmm, maybe I can just run something like > memtest which locks a bunch of ram... after you run your code, check the meminfo, the cached value might be much lower than u expected. my feeling is that linux page cache will discard all cache if last file handle closed. > > Thanks for all of the help so far! > > Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 12:30 ` Ming Zhang @ 2005-07-14 14:23 ` Ming Zhang 2005-07-14 17:54 ` Dan Christensen 1 sibling, 0 replies; 41+ messages in thread From: Ming Zhang @ 2005-07-14 14:23 UTC (permalink / raw) To: Dan Christensen; +Cc: David Greaves, Linux RAID my problem here. this only apply to sdX not mdX. pls ignore this. ming On Thu, 2005-07-14 at 08:30 -0400, Ming Zhang wrote: > > Also, is there a way to disable caching of reads? Having to clear > > the cache by reading 900M each time slows down testing. I guess > > I could reboot with mem=100M, but it'd be nice to disable/enable > > caching on the fly. Hmm, maybe I can just run something like > > memtest which locks a bunch of ram... > after you run your code, check the meminfo, the cached value might be > much lower than u expected. my feeling is that linux page cache will > discard all cache if last file handle closed. > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 12:30 ` Ming Zhang 2005-07-14 14:23 ` Ming Zhang @ 2005-07-14 17:54 ` Dan Christensen 2005-07-14 18:00 ` Ming Zhang 1 sibling, 1 reply; 41+ messages in thread From: Dan Christensen @ 2005-07-14 17:54 UTC (permalink / raw) To: linux-raid Ming Zhang <mingz@ele.uri.edu> writes: > On Wed, 2005-07-13 at 23:58 -0400, Dan Christensen wrote: > >> But I still don't understand why the md layer can't do better. If I >> turn off readahead of the raw devices, and keep it for the raid >> device, then parity blocks should never be requested, so they >> shouldn't use any bus/controller bandwidth. And even if each drive is >> only acting at 75% efficiency, the four drives should still be able to >> saturate the bus/controller. So I can't figure out what's going on >> here. > > when read, i do not think MD will read parity at all. but since parity > is on all disk, there might be a seek here. Yes, there will be a seek, or internal drive readahead, so each drive will operate at around 75% efficiency. But since that shouldn't affect bus/controller traffic, I still would expect to get over 100MB/s with my hardware. >> Also, is there a way to disable caching of reads? > > after you run your code, check the meminfo, the cached value might be > much lower than u expected. my feeling is that linux page cache will > discard all cache if last file handle closed. Ming Zhang <mingz@ele.uri.edu> writes: > my problem here. this only apply to sdX not mdX. pls ignore this. I'm not sure what you mean. For reads from sdX, mdX, files on sdX or files on mdX, the cache is retained. So it's necessary to clear this cache to get valid timing results. Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 17:54 ` Dan Christensen @ 2005-07-14 18:00 ` Ming Zhang 2005-07-14 18:03 ` Dan Christensen 0 siblings, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-14 18:00 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID On Thu, 2005-07-14 at 13:54 -0400, Dan Christensen wrote: > Ming Zhang <mingz@ele.uri.edu> writes: > > > On Wed, 2005-07-13 at 23:58 -0400, Dan Christensen wrote: > > > >> But I still don't understand why the md layer can't do better. If I > >> turn off readahead of the raw devices, and keep it for the raid > >> device, then parity blocks should never be requested, so they > >> shouldn't use any bus/controller bandwidth. And even if each drive is > >> only acting at 75% efficiency, the four drives should still be able to > >> saturate the bus/controller. So I can't figure out what's going on > >> here. > > > > when read, i do not think MD will read parity at all. but since parity > > is on all disk, there might be a seek here. > > Yes, there will be a seek, or internal drive readahead, so each drive > will operate at around 75% efficiency. But since that shouldn't > affect bus/controller traffic, I still would expect to get over > 100MB/s with my hardware. agree. but what if your controller is a bottleneck? u need to have another card to find out. > > >> Also, is there a way to disable caching of reads? > > > > after you run your code, check the meminfo, the cached value might be > > much lower than u expected. my feeling is that linux page cache will > > discard all cache if last file handle closed. > > Ming Zhang <mingz@ele.uri.edu> writes: > > > my problem here. this only apply to sdX not mdX. pls ignore this. > > I'm not sure what you mean. For reads from sdX, mdX, files on sdX > or files on mdX, the cache is retained. So it's necessary to clear > this cache to get valid timing results. yes, i was insane at that time, pls ignore these blah blah. > > Dan > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 18:00 ` Ming Zhang @ 2005-07-14 18:03 ` Dan Christensen 2005-07-14 18:10 ` Ming Zhang 0 siblings, 1 reply; 41+ messages in thread From: Dan Christensen @ 2005-07-14 18:03 UTC (permalink / raw) To: mingz; +Cc: Linux RAID [Ming, could you trim quoted material down a bit more, and leave a blank line between quoted material and your new text? Thanks.] Ming Zhang <mingz@ele.uri.edu> writes: > On Thu, 2005-07-14 at 13:54 -0400, Dan Christensen wrote: >> >> Yes, there will be a seek, or internal drive readahead, so each drive >> will operate at around 75% efficiency. But since that shouldn't >> affect bus/controller traffic, I still would expect to get over >> 100MB/s with my hardware. > > agree. but what if your controller is a bottleneck? u need to have > another card to find out. The controller and/or bus *is* the bottleneck, but I've already shown that I can get 106MB/s through them. Dan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 18:03 ` Dan Christensen @ 2005-07-14 18:10 ` Ming Zhang 2005-07-14 19:16 ` Dan Christensen 0 siblings, 1 reply; 41+ messages in thread From: Ming Zhang @ 2005-07-14 18:10 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID On Thu, 2005-07-14 at 14:03 -0400, Dan Christensen wrote: > [Ming, could you trim quoted material down a bit more, and leave a > blank line between quoted material and your new text? Thanks.] thanks. sorry about that. > > Ming Zhang <mingz@ele.uri.edu> writes: > > > On Thu, 2005-07-14 at 13:54 -0400, Dan Christensen wrote: > >> > > agree. but what if your controller is a bottleneck? u need to have > > another card to find out. > > The controller and/or bus *is* the bottleneck, but I've already shown > that I can get 106MB/s through them. > > Dan then can u test RAID0 a bit? That is easier to analyze. Ming ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 18:10 ` Ming Zhang @ 2005-07-14 19:16 ` Dan Christensen 2005-07-14 20:13 ` Ming Zhang 0 siblings, 1 reply; 41+ messages in thread From: Dan Christensen @ 2005-07-14 19:16 UTC (permalink / raw) To: linux-raid Ming Zhang <mingz@ele.uri.edu> writes: > then can u test RAID0 a bit? That is easier to analyze. I can't easily test RAID-0 with my set-up, but I can test RAID-1 with two partitions. I found that the read speed from the md device was about the same as the read speed from each partition. This was with readahead set to 4096 on the md device, so I had hoped that it would do better. Based on the output of iostat, it looks like the reads were shared roughly equally between the two partitions (53%/47%). Does the RAID-1 code try to take the first stripe from disk 1, the second from disk 2, alternately? Or is it clever enough to try to take the first dozen from disk 1, the next dozen from disk 2, etc, in order to get larger, contiguous reads? It's less clear to me that RAID-1 with two drives will be able to overcome the overhead of skipping various blocks. But it seems like RAID-5 with four drives should be able to saturate my bus/controller. For example, RAID-5 could just do sequential reads from 3 of the 4 drives, and use the parity chunks it reads to reconstruct the data chunks from the fourth drive. If I do parallel reads from 3 of my 4 disks, I can still get 106MB/s. Dan PS: Here's my simple test script, cleaned up a bit: #!/bin/sh # Devices to test for speed, and megabytes to read. MDDEV=/dev/md2 MDMB=300 RAWDEVS="/dev/sda7 /dev/sdb5 /dev/sdc5 /dev/sdd5" RAWMB=300 # Device to read to clear cache, and amount in megabytes. CACHEDEV=/dev/sda8 CACHEMB=900 clearcache () { echo "Clearing cache..." dd if=$CACHEDEV of=/dev/null bs=1M count=$CACHEMB > /dev/null 2>&1 } testdev () { echo "Read test from $1..." dd if=$1 of=/dev/null bs=1M count=$2 2>&1 | grep bytes/sec echo } clearcache for f in $RAWDEVS ; do testdev $f $RAWMB done clearcache testdev $MDDEV $MDMB ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 19:16 ` Dan Christensen @ 2005-07-14 20:13 ` Ming Zhang 0 siblings, 0 replies; 41+ messages in thread From: Ming Zhang @ 2005-07-14 20:13 UTC (permalink / raw) To: Dan Christensen; +Cc: Linux RAID raid5 can not be that smart. :P ming On Thu, 2005-07-14 at 15:16 -0400, Dan Christensen wrote: > bus/controller. For example, RAID-5 could just do sequential > reads from 3 of the 4 drives, and use the parity chunks it > reads to reconstruct the data chunks from the fourth drive. > If I do parallel reads from 3 of my 4 disks, I can still get > 106MB/s. > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-14 3:58 ` Dan Christensen 2005-07-14 4:13 ` Mark Hahn 2005-07-14 12:30 ` Ming Zhang @ 2005-07-15 2:38 ` Dan Christensen 2005-07-15 6:01 ` Holger Kiehl 2 siblings, 1 reply; 41+ messages in thread From: Dan Christensen @ 2005-07-15 2:38 UTC (permalink / raw) To: linux-raid Summary so far: RAID-5, four SATA hard drives, 2.6.12.2 kernel. Testing streaming read speed. With readahead optimized, I get: each raw device: 58MB/s raid device: 78MB/s 3 or 4 parallel reads from the raw devices: 106MB/s I'm trying to figure out why the last two numbers differ. I was afraid that for some reason the kernel was requesting the parity blocks instead of just the data blocks, but by using iostat it's pretty clear that the right number of blocks are being requested from the raw devices. If I write a dumb program that reads 3 out of every 4 64k chunks of a raw device, the kernel readahead kicks in and chunks I skip over do contribute to the iostat numbers. But the raid layer is correctly avoiding this readahead. One other theory at this point is that my controller is trying to be clever and doing some readahead itself. Even if this is the case, I'd be surprised if this would cause a problem, since the data won't have to go over the bus. But maybe the controller is doing this and is causing itself to become overloaded? My controller is a Silicon Image 3114. Details at the end, for the record. Second theory: for contiguous streams from the raw devices, the reads are done in really big chunks. But for md layer reads, the biggest possible chunk is 3 x 64k, if you want to skip parity blocks. Could 3 x 64k be small enough to cause overhead? Seems unlikely. Those are my only guesses. Any others? It seems strange that I can beat the md layer in userspace by 33%, by just reading from three of the devices and using parity to reconstruct the fourth! Thanks again for all the help. I've learned a lot! And I haven't even started working on write speed... Dan 0000:01:0b.0 RAID bus controller: Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02) Subsystem: Silicon Image, Inc. (formerly CMD Technology Inc): Unknown device 6114 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32, Cache Line Size: 0x08 (32 bytes) Interrupt: pin A routed to IRQ 177 Region 0: I/O ports at 9400 [size=8] Region 1: I/O ports at 9800 [size=4] Region 2: I/O ports at 9c00 [size=8] Region 3: I/O ports at a000 [size=4] Region 4: I/O ports at a400 [size=16] Region 5: Memory at e1001000 (32-bit, non-prefetchable) [size=1K] Capabilities: [60] Power Management version 2 Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=2 PME- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-15 2:38 ` Dan Christensen @ 2005-07-15 6:01 ` Holger Kiehl 2005-07-15 12:29 ` Ming Zhang 0 siblings, 1 reply; 41+ messages in thread From: Holger Kiehl @ 2005-07-15 6:01 UTC (permalink / raw) To: Dan Christensen; +Cc: linux-raid Hello On Thu, 14 Jul 2005, Dan Christensen wrote: > Summary so far: > > RAID-5, four SATA hard drives, 2.6.12.2 kernel. Testing streaming > read speed. With readahead optimized, I get: > > each raw device: 58MB/s > raid device: 78MB/s > 3 or 4 parallel reads > from the raw devices: 106MB/s > > I'm trying to figure out why the last two numbers differ. > Have you checked what the performance with a 2.4.x kernel is? If I remember correctly there was some discussion on this list that 2.4 raid5 has better read performance. Holger ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-15 6:01 ` Holger Kiehl @ 2005-07-15 12:29 ` Ming Zhang 0 siblings, 0 replies; 41+ messages in thread From: Ming Zhang @ 2005-07-15 12:29 UTC (permalink / raw) To: Holger Kiehl; +Cc: Dan Christensen, Linux RAID in my previous test, using SATA, i got better result in 2.6 instead of 2.4. :P Ming On Fri, 2005-07-15 at 06:01 +0000, Holger Kiehl wrote: > > I'm trying to figure out why the last two numbers differ. > > > Have you checked what the performance with a 2.4.x kernel is? If I > remember correctly there was some discussion on this list that 2.4 > raid5 > has better read performance. > > Holger > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: RAID-5 streaming read performance 2005-07-13 12:48 ` Dan Christensen 2005-07-13 12:52 ` Ming Zhang @ 2005-07-13 22:42 ` Neil Brown 1 sibling, 0 replies; 41+ messages in thread From: Neil Brown @ 2005-07-13 22:42 UTC (permalink / raw) To: Dan Christensen; +Cc: mingz, Linux RAID On Wednesday July 13, jdc@uwo.ca wrote: > Question for the list: if I'm doing a long sequential write, naively > each parity block will get recalculated and rewritten several times, > once for each non-parity block in the stripe. Does the write-caching > that the kernel does mean that each parity block will only get written > once? Raid5 does the best it can. It delays write requests as long as possible, and then when it must do the write, it writes every other block in the stripe that it has been asked to write, so only one parity update is needed for all those blocks. My tests suggest that for long sequential writes (Without syncs) this achieves full-stripe writes most of the time. NeilBrown ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2005-07-15 12:29 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-07-11 15:11 RAID-5 streaming read performance Dan Christensen 2005-07-13 2:08 ` Ming Zhang 2005-07-13 2:52 ` Dan Christensen 2005-07-13 3:15 ` berk walker 2005-07-13 12:24 ` Ming Zhang 2005-07-13 12:48 ` Dan Christensen 2005-07-13 12:52 ` Ming Zhang 2005-07-13 14:23 ` Dan Christensen 2005-07-13 14:29 ` Ming Zhang 2005-07-13 17:56 ` Dan Christensen 2005-07-13 22:38 ` Neil Brown 2005-07-14 0:09 ` Ming Zhang 2005-07-14 1:16 ` Neil Brown 2005-07-14 1:25 ` Ming Zhang 2005-07-13 18:02 ` David Greaves 2005-07-13 18:14 ` Ming Zhang 2005-07-13 21:18 ` David Greaves 2005-07-13 21:44 ` Ming Zhang 2005-07-13 21:50 ` David Greaves 2005-07-13 21:55 ` Ming Zhang 2005-07-13 22:52 ` Neil Brown 2005-07-14 3:58 ` Dan Christensen 2005-07-14 4:13 ` Mark Hahn 2005-07-14 21:16 ` Dan Christensen 2005-07-14 21:30 ` Ming Zhang 2005-07-14 23:29 ` Mark Hahn 2005-07-15 1:23 ` Ming Zhang 2005-07-15 2:11 ` Dan Christensen 2005-07-15 12:28 ` Ming Zhang 2005-07-14 12:30 ` Ming Zhang 2005-07-14 14:23 ` Ming Zhang 2005-07-14 17:54 ` Dan Christensen 2005-07-14 18:00 ` Ming Zhang 2005-07-14 18:03 ` Dan Christensen 2005-07-14 18:10 ` Ming Zhang 2005-07-14 19:16 ` Dan Christensen 2005-07-14 20:13 ` Ming Zhang 2005-07-15 2:38 ` Dan Christensen 2005-07-15 6:01 ` Holger Kiehl 2005-07-15 12:29 ` Ming Zhang 2005-07-13 22:42 ` Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).