* raid5 that used parity for reads only when degraded @ 2006-03-22 23:47 Alex Izvorski 2006-03-23 0:13 ` Neil Brown 0 siblings, 1 reply; 7+ messages in thread From: Alex Izvorski @ 2006-03-22 23:47 UTC (permalink / raw) To: linux-raid Hello, I have a question: I'd like to have a raid5 array which writes parity data but does not check it during reads while the array is ok. I would trust each disk to detect errors itself and cause the array to be degraded if necessary, in which case that disk would drop out and the parity data would start being used just as in a normal raid5. In other words until there is an I/O error that causes a disk to drop out, such an array would behave almost like a raid0 with N-1 disks as far as reads are concerned. Ideally this behavior would be something that one could turn on/off on the fly with a ioctl or via a echo "0" > /sys/block/md0/check_parity_on_reads type of mechanism. How hard is this to do? Is anyone interested in helping to do this? I think it would really help applications which have a lot more reads than writes. Where exactly does parity checking during reads happen? I've looked over the code briefly but the right part of it didn't appear obvious ;) Regards, --Alex ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 that used parity for reads only when degraded 2006-03-22 23:47 raid5 that used parity for reads only when degraded Alex Izvorski @ 2006-03-23 0:13 ` Neil Brown 2006-03-24 4:38 ` Alex Izvorski 0 siblings, 1 reply; 7+ messages in thread From: Neil Brown @ 2006-03-23 0:13 UTC (permalink / raw) To: Alex Izvorski; +Cc: linux-raid On Wednesday March 22, aizvorski@gmail.com wrote: > Hello, > > I have a question: I'd like to have a raid5 array which writes parity data but > does not check it during reads while the array is ok. I would trust each disk > to detect errors itself and cause the array to be degraded if necessary, in > which case that disk would drop out and the parity data would start being used > just as in a normal raid5. In other words until there is an I/O error that > causes a disk to drop out, such an array would behave almost like a raid0 with > N-1 disks as far as reads are concerned. Ideally this behavior would be > something that one could turn on/off on the fly with a ioctl or via a echo "0" > > /sys/block/md0/check_parity_on_reads type of mechanism. > > How hard is this to do? Is anyone interested in helping to do this? I think > it would really help applications which have a lot more reads than writes. > Where exactly does parity checking during reads happen? I've looked over the > code briefly but the right part of it didn't appear obvious ;) Parity checking does not happen during read. You already have what you want. NeilBrown ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 that used parity for reads only when degraded 2006-03-23 0:13 ` Neil Brown @ 2006-03-24 4:38 ` Alex Izvorski 2006-03-24 4:38 ` Neil Brown 2006-03-24 17:19 ` raid5 that used parity for reads only when degraded dean gaudet 0 siblings, 2 replies; 7+ messages in thread From: Alex Izvorski @ 2006-03-24 4:38 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Neil - Thank you very much for the response. In my tests with identically configured raid0 and raid5 arrays, raid5 initially had much lower throughput during reads. I had assumed that was because raid5 did parity-checking all the time. It turns out that raid5 throughput can get fairly close to raid0 throughput if /sys/block/md0/md/stripe_cache_size is set to a very high value, 8192-16384. However the cpu load is still very much higher during raid5 reads. I'm not sure why? My test setup consists of 8x WD4000RE 400GB SATA disks, a 2.4GHz Athlon64X2 cpu and 2GB RAM, kernel 2.6.15 and mdadm 2.3. I am using my own simple test application which uses POSIX aio to do randomly positioned block reads. When doing 8mb block reads, 14 outstanding io requests, from a 7-disk raid0 with 1mb chunk size I get 200MB/s throughput and ~5% cpu load. When running the same on an 8-disk raid5 with the same chunk size (which I'd expect to have identical performance, as per what you describe as the behaviour of a non-degraded raid5) with default stripe_cache_size of 256 I get a mere 60MB/s and a cpu load of ~12%. Increasing the stripe_cache_size to 8192 brings the throughput to approximately 200MB/s or the same as for the raid0, but the cpu load jumps to 45%. Some other combinations of parameters, e.g. 32MB chunk size and 4MB reads with stripe_cache_size of 16384 result in even more pathological cpu loads, over 80% (that is: 80% of both cpus!) with throughput still at approx 200MB/s. As a point of comparison the same application reading directly from the raw disk devices with the same settings achieves a total throughput of 300MB/s and a cpu load of 3%, so I am pretty sure the SATA controllers or drivers etc are not a factor. Also the cpu load is measured with Andrew Morton's cyclesoak tool which I believe to be quite accurate. Any thoughts on what could be causing the high cpu load? I am very interested in helping debug this since I really need a high-throughput raid5 with reasonably low cpu requirements. Please let me know if you have any ideas or anything you'd like me to try (valgrind, perhaps?). I'd be happy to give you more details on the test setup as well. Sincerely, --Alex On Thu, 2006-03-23 at 11:13 +1100, Neil Brown wrote: > On Wednesday March 22, aizvorski@gmail.com wrote: > > Hello, > > > > I have a question: I'd like to have a raid5 array which writes parity data but > > does not check it during reads while the array is ok. I would trust each disk > > to detect errors itself and cause the array to be degraded if necessary, in > > which case that disk would drop out and the parity data would start being used > > just as in a normal raid5. In other words until there is an I/O error that > > causes a disk to drop out, such an array would behave almost like a raid0 with > > N-1 disks as far as reads are concerned. Ideally this behavior would be > > something that one could turn on/off on the fly with a ioctl or via a echo "0" > > > /sys/block/md0/check_parity_on_reads type of mechanism. > > > > How hard is this to do? Is anyone interested in helping to do this? I think > > it would really help applications which have a lot more reads than writes. > > Where exactly does parity checking during reads happen? I've looked over the > > code briefly but the right part of it didn't appear obvious ;) > > Parity checking does not happen during read. You already have what > you want. > > NeilBrown ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 that used parity for reads only when degraded 2006-03-24 4:38 ` Alex Izvorski @ 2006-03-24 4:38 ` Neil Brown 2006-03-24 9:02 ` raid5 high cpu usage during reads Alex Izvorski 2006-03-24 17:19 ` raid5 that used parity for reads only when degraded dean gaudet 1 sibling, 1 reply; 7+ messages in thread From: Neil Brown @ 2006-03-24 4:38 UTC (permalink / raw) To: Alex Izvorski; +Cc: linux-raid On Thursday March 23, aizvorski@gmail.com wrote: > Neil - Thank you very much for the response. > > In my tests with identically configured raid0 and raid5 arrays, raid5 > initially had much lower throughput during reads. I had assumed that > was because raid5 did parity-checking all the time. It turns out that > raid5 throughput can get fairly close to raid0 throughput > if /sys/block/md0/md/stripe_cache_size is set to a very high value, > 8192-16384. However the cpu load is still very much higher during raid5 > reads. I'm not sure why? Probably all the memcpys. For a raid5 read, the data is DMAed from the device into the stripe_cache, and then memcpy is used to move it to the filesystem (or other client) buffer. Worse: this memcpy happens on only one CPU so a multiprocessor won't make it go any after. I would be possible to bypass the stripe_cache for reads from a non-degraded array (I did it for 2.4) but it is somewhat more complex in 2.6 and I haven't attempted it yet (there have always been other more interesting things to do). To test is this is the problem you could probably just comment-out the memcpy (the copy_data in handle_stripe) and see if the reads go faster. Obviously you will be getting garbage back, but it should give you a reasonably realistic measure of the cost. NeilBrown ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 high cpu usage during reads 2006-03-24 4:38 ` Neil Brown @ 2006-03-24 9:02 ` Alex Izvorski 0 siblings, 0 replies; 7+ messages in thread From: Alex Izvorski @ 2006-03-24 9:02 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid On Fri, 2006-03-24 at 15:38 +1100, Neil Brown wrote: > On Thursday March 23, aizvorski@gmail.com wrote: > > Neil - Thank you very much for the response. > > > > In my tests with identically configured raid0 and raid5 arrays, raid5 > > initially had much lower throughput during reads. I had assumed that > > was because raid5 did parity-checking all the time. It turns out that > > raid5 throughput can get fairly close to raid0 throughput > > if /sys/block/md0/md/stripe_cache_size is set to a very high value, > > 8192-16384. However the cpu load is still very much higher during raid5 > > reads. I'm not sure why? > > Probably all the memcpys. > For a raid5 read, the data is DMAed from the device into the > stripe_cache, and then memcpy is used to move it to the filesystem (or > other client) buffer. Worse: this memcpy happens on only one CPU so a > multiprocessor won't make it go any after. > > I would be possible to bypass the stripe_cache for reads from a > non-degraded array (I did it for 2.4) but it is somewhat more complex > in 2.6 and I haven't attempted it yet (there have always been other > more interesting things to do). > > To test is this is the problem you could probably just comment-out the > memcpy (the copy_data in handle_stripe) and see if the reads go > faster. Obviously you will be getting garbage back, but it should > give you a reasonably realistic measure of the cost. > > NeilBrown Neil - Thank you again for the suggestion. I did as you said and commented out copy_data() and ran a number of tests with the modified kernel. The results are in a spreadsheet-importable format at the end of this email (let me know if I should send them in some other way). In short, this gives a fairly consistent 20% reduction in CPU usage under max throughput conditions, i.e. typically that accounts for just over half the difference in CPU usage between raid0 and raid5, everything else being equal. By the way, on the same machine memcpy() benchmarks at ~1GB/s, so if the data being is read at 200MB/s and copied once that would be about 10% CPU load - perhaps the data actually gets copied twice? That would be consistent. Anyway, it seems copy_data() is definitely part of the answer, but not the whole answer. In the case of 32MB stripes, something else uses up to 60% of the CPU time. Perhaps some kind of O(n^2) scalability issue in the stripe cache data structures? I'm not positive, but it seems the hit outside copy_data() is particularly large in situations in which stripe_cache_active returns large numbers. How hard is it to bypass the stripe cache for reads? I would certainly lobby for you to work on that ;) since without it raid5 is only really suitable for database-type workloads, not multimedia-type workloads (again bearing in mind that a full-speed read by itself uses up an entire high-end CPU or more - you can understand why I thought it was calculating parity ;) I'll do what I can to help, of course. Let me know what other tests I can run. Regards, --Alex "raid level"|"num disks"|"chunk size, kB"|"copy_data disabled"|"stripe cache size"|"block read size, MB"|"num concurrent reads"|"throughput, MB/s"|"cpu load, %" raid5|8|64|N|8192|8|14|186|35 raid0|7|64|-|-|8|14|243|7 raid5|8|64|N|8192|256|1|215|38 raid0|7|64|-|-|256|1|272|7 raid5|8|256|Y|8192|8|14|201|17 raid5|8|256|N|8192|8|14|200|40 raid0|7|256|-|-|8|14|241|4 raid5|8|256|Y|8192|256|1|221|17 raid5|8|256|N|8192|256|1|218|40 raid0|7|256|-|-|256|1|260|6 raid5|8|1024|Y|8192|8|14|207|20 raid5|8|1024|N|8192|8|14|206|40 raid0|7|1024|-|-|8|14|243|5 raid5|8|32768|Y|16384|8|14|227|60 raid5|8|32768|N|16384|8|14|208|80 raid0|7|32768|-|-|8|14|244|15 raid5|8|32768|Y|16384|256|1|212|25 raid5|8|32768|N|16384|256|1|207|45 raid0|7|32768|-|-|256|1|217|10 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 that used parity for reads only when degraded 2006-03-24 4:38 ` Alex Izvorski 2006-03-24 4:38 ` Neil Brown @ 2006-03-24 17:19 ` dean gaudet 2006-03-24 23:16 ` Alex Izvorski 1 sibling, 1 reply; 7+ messages in thread From: dean gaudet @ 2006-03-24 17:19 UTC (permalink / raw) To: Alex Izvorski; +Cc: Neil Brown, linux-raid On Thu, 23 Mar 2006, Alex Izvorski wrote: > Also the cpu load is measured with Andrew Morton's cyclesoak > tool which I believe to be quite accurate. there's something cyclesoak does which i'm not sure i agree with: cyclesoak process dirties an array of 1000000 bytes... so what you're really getting is some sort of composite measurement of memory system utilisation and cpu cycle availability. i think that 1MB number was chosen before 1MiB caches were common... and what you get during calibration is a L2 cache-hot loop, but i'm not sure that's an important number. i'd look at what happens if you increase cyclesoak.c busyloop_size to 8MB ... and decrease it to 128. the two extremes are going to weight the "cpu load" towards measuring available memory system bandwidth and available cpu cycles. also for calibration consider using a larger "-p n" ... especially if you've got any cpufreq/powernowd setup which is varying your clock rates... you want to be sure that it's calibrated (and measured) at a fixed clock rate. -dean ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid5 that used parity for reads only when degraded 2006-03-24 17:19 ` raid5 that used parity for reads only when degraded dean gaudet @ 2006-03-24 23:16 ` Alex Izvorski 0 siblings, 0 replies; 7+ messages in thread From: Alex Izvorski @ 2006-03-24 23:16 UTC (permalink / raw) To: dean gaudet; +Cc: linux-raid On Fri, 2006-03-24 at 09:19 -0800, dean gaudet wrote: > On Thu, 23 Mar 2006, Alex Izvorski wrote: > > > Also the cpu load is measured with Andrew Morton's cyclesoak > > tool which I believe to be quite accurate. > > there's something cyclesoak does which i'm not sure i agree with: > cyclesoak process dirties an array of 1000000 bytes... so what you're > really getting is some sort of composite measurement of memory system > utilisation and cpu cycle availability. > > i think that 1MB number was chosen before 1MiB caches were common... and > what you get during calibration is a L2 cache-hot loop, but i'm not sure > that's an important number. > > i'd look at what happens if you increase cyclesoak.c busyloop_size to 8MB > ... and decrease it to 128. the two extremes are going to weight the "cpu > load" towards measuring available memory system bandwidth and available > cpu cycles. > > also for calibration consider using a larger "-p n" ... especially if > you've got any cpufreq/powernowd setup which is varying your clock > rates... you want to be sure that it's calibrated (and measured) at a > fixed clock rate. > > -dean Dean - those are interesting ideas. I tried them out, but they do not appear to make much difference: the measured load with busyloop_size of 128, 1M and 8M is the same within a couple of percent. As far as I can determine busyloop spends most of its time in the "for (thumb = 0; thumb < twiddle; thumb++)" loop, and only touches about 150MB memory per second (2.3M loops/sec, one cacheline or 64 bytes affected per loop). I don't have cpufreq so that's not a factor. So far everything leads me to believe that what cyclesoak reports is quite accurate. I've even confirmed it by timing other cpu-bound tasks (like compressing a file in memory) and the results are essentially identical. Regards, --Alex ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-03-24 23:16 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-22 23:47 raid5 that used parity for reads only when degraded Alex Izvorski 2006-03-23 0:13 ` Neil Brown 2006-03-24 4:38 ` Alex Izvorski 2006-03-24 4:38 ` Neil Brown 2006-03-24 9:02 ` raid5 high cpu usage during reads Alex Izvorski 2006-03-24 17:19 ` raid5 that used parity for reads only when degraded dean gaudet 2006-03-24 23:16 ` Alex Izvorski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).