raid5 that used parity for reads only when degraded

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid5 that used parity for reads only when degraded
@ 2006-03-22 23:47 Alex Izvorski
  2006-03-23  0:13 ` Neil Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Izvorski @ 2006-03-22 23:47 UTC (permalink / raw)
  To: linux-raid

Hello,

I have a question: I'd like to have a raid5 array which writes parity data but
does not check it during reads while the array is ok.  I would trust each disk
to detect errors itself and cause the array to be degraded if necessary, in
which case that disk would drop out and the parity data would start being used
just as in a normal raid5.  In other words until there is an I/O error that
causes a disk to drop out, such an array would behave almost like a raid0 with
N-1 disks as far as reads are concerned.  Ideally this behavior would be
something that one could turn on/off on the fly with a ioctl or via a echo "0" >
/sys/block/md0/check_parity_on_reads type of mechanism.  

How hard is this to do?   Is anyone interested in helping to do this?  I think
it would really help applications which have a lot more reads than writes. 
Where exactly does parity checking during reads happen?  I've looked over the
code briefly but the right part of it didn't appear obvious ;)

Regards,
--Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid5 that used parity for reads only when degraded
  2006-03-22 23:47 raid5 that used parity for reads only when degraded Alex Izvorski
@ 2006-03-23  0:13 ` Neil Brown
  2006-03-24  4:38   ` Alex Izvorski
  0 siblings, 1 reply; 7+ messages in thread
From: Neil Brown @ 2006-03-23  0:13 UTC (permalink / raw)
  To: Alex Izvorski; +Cc: linux-raid

On Wednesday March 22, aizvorski@gmail.com wrote:
> Hello,
> 
> I have a question: I'd like to have a raid5 array which writes parity data but
> does not check it during reads while the array is ok.  I would trust each disk
> to detect errors itself and cause the array to be degraded if necessary, in
> which case that disk would drop out and the parity data would start being used
> just as in a normal raid5.  In other words until there is an I/O error that
> causes a disk to drop out, such an array would behave almost like a raid0 with
> N-1 disks as far as reads are concerned.  Ideally this behavior would be
> something that one could turn on/off on the fly with a ioctl or via a echo "0" >
> /sys/block/md0/check_parity_on_reads type of mechanism.  
> 
> How hard is this to do?   Is anyone interested in helping to do this?  I think
> it would really help applications which have a lot more reads than writes. 
> Where exactly does parity checking during reads happen?  I've looked over the
> code briefly but the right part of it didn't appear obvious ;)

Parity checking does not happen during read.  You already have what
you want.

NeilBrown

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid5 that used parity for reads only when degraded
  2006-03-23  0:13 ` Neil Brown
@ 2006-03-24  4:38   ` Alex Izvorski
  2006-03-24  4:38     ` Neil Brown
  2006-03-24 17:19     ` raid5 that used parity for reads only when degraded dean gaudet
  0 siblings, 2 replies; 7+ messages in thread
From: Alex Izvorski @ 2006-03-24  4:38 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil - Thank you very much for the response.  

In my tests with identically configured raid0 and raid5 arrays, raid5
initially had much lower throughput during reads.  I had assumed that
was because raid5 did parity-checking all the time.  It turns out that
raid5 throughput can get fairly close to raid0 throughput
if /sys/block/md0/md/stripe_cache_size is set to a very high value,
8192-16384.  However the cpu load is still very much higher during raid5
reads.  I'm not sure why?

My test setup consists of 8x WD4000RE 400GB SATA disks, a 2.4GHz
Athlon64X2 cpu and 2GB RAM, kernel 2.6.15 and mdadm 2.3.  I am using my
own simple test application which uses POSIX aio to do randomly
positioned block reads.  When doing 8mb block reads, 14 outstanding io
requests, from a 7-disk raid0 with 1mb chunk size I get 200MB/s
throughput and ~5% cpu load.  When running the same on an 8-disk raid5
with the same chunk size (which I'd expect to have identical
performance, as per what you describe as the behaviour of a non-degraded
raid5) with default stripe_cache_size of 256 I get a mere 60MB/s and a
cpu load of ~12%.  Increasing the stripe_cache_size to 8192 brings the
throughput to approximately 200MB/s or the same as for the raid0, but
the cpu load jumps to 45%.  Some other combinations of parameters, e.g.
32MB chunk size and 4MB reads with stripe_cache_size of 16384 result in
even more pathological cpu loads, over 80% (that is: 80% of both cpus!)
with throughput still at approx 200MB/s.  As a point of comparison the
same application reading directly from the raw disk devices with the
same settings achieves a total throughput of 300MB/s and a cpu load of
3%, so I am pretty sure the SATA controllers or drivers etc are not a
factor.  Also the cpu load is measured with Andrew Morton's cyclesoak
tool which I believe to be quite accurate.

Any thoughts on what could be causing the high cpu load?  I am very
interested in helping debug this since I really need a high-throughput
raid5 with reasonably low cpu requirements.  Please let me know if you
have any ideas or anything you'd like me to try (valgrind, perhaps?).
I'd be happy to give you more details on the test setup as well.

Sincerely,

--Alex

On Thu, 2006-03-23 at 11:13 +1100, Neil Brown wrote:
> On Wednesday March 22, aizvorski@gmail.com wrote:
> > Hello,
> > 
> > I have a question: I'd like to have a raid5 array which writes parity data but
> > does not check it during reads while the array is ok.  I would trust each disk
> > to detect errors itself and cause the array to be degraded if necessary, in
> > which case that disk would drop out and the parity data would start being used
> > just as in a normal raid5.  In other words until there is an I/O error that
> > causes a disk to drop out, such an array would behave almost like a raid0 with
> > N-1 disks as far as reads are concerned.  Ideally this behavior would be
> > something that one could turn on/off on the fly with a ioctl or via a echo "0" >
> > /sys/block/md0/check_parity_on_reads type of mechanism.  
> > 
> > How hard is this to do?   Is anyone interested in helping to do this?  I think
> > it would really help applications which have a lot more reads than writes. 
> > Where exactly does parity checking during reads happen?  I've looked over the
> > code briefly but the right part of it didn't appear obvious ;)
> 
> Parity checking does not happen during read.  You already have what
> you want.
> 
> NeilBrown

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid5 that used parity for reads only when degraded
  2006-03-24  4:38   ` Alex Izvorski
@ 2006-03-24  4:38     ` Neil Brown
  2006-03-24  9:02       ` raid5 high cpu usage during reads Alex Izvorski
  2006-03-24 17:19     ` raid5 that used parity for reads only when degraded dean gaudet
  1 sibling, 1 reply; 7+ messages in thread
From: Neil Brown @ 2006-03-24  4:38 UTC (permalink / raw)
  To: Alex Izvorski; +Cc: linux-raid

On Thursday March 23, aizvorski@gmail.com wrote:
> Neil - Thank you very much for the response.  
> 
> In my tests with identically configured raid0 and raid5 arrays, raid5
> initially had much lower throughput during reads.  I had assumed that
> was because raid5 did parity-checking all the time.  It turns out that
> raid5 throughput can get fairly close to raid0 throughput
> if /sys/block/md0/md/stripe_cache_size is set to a very high value,
> 8192-16384.  However the cpu load is still very much higher during raid5
> reads.  I'm not sure why?

Probably all the memcpys.
For a raid5 read, the data is DMAed from the device into the
stripe_cache, and then memcpy is used to move it to the filesystem (or
other client) buffer.  Worse: this memcpy happens on only one CPU so a
multiprocessor won't make it go any after.

I would be possible to bypass the stripe_cache for reads from a
non-degraded array (I did it for 2.4) but it is somewhat more complex
in 2.6 and I haven't attempted it yet (there have always been other
more interesting things to do).

To test is this is the problem you could probably just comment-out the
memcpy (the copy_data in handle_stripe) and see if the reads go
faster.  Obviously you will be getting garbage back, but it should
give you a reasonably realistic measure of the cost.

NeilBrown

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid5 high cpu usage during reads
  2006-03-24  4:38     ` Neil Brown
@ 2006-03-24  9:02       ` Alex Izvorski
  0 siblings, 0 replies; 7+ messages in thread
From: Alex Izvorski @ 2006-03-24  9:02 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, 2006-03-24 at 15:38 +1100, Neil Brown wrote:
> On Thursday March 23, aizvorski@gmail.com wrote:
> > Neil - Thank you very much for the response.  
> > 
> > In my tests with identically configured raid0 and raid5 arrays, raid5
> > initially had much lower throughput during reads.  I had assumed that
> > was because raid5 did parity-checking all the time.  It turns out that
> > raid5 throughput can get fairly close to raid0 throughput
> > if /sys/block/md0/md/stripe_cache_size is set to a very high value,
> > 8192-16384.  However the cpu load is still very much higher during raid5
> > reads.  I'm not sure why?
> 
> Probably all the memcpys.
> For a raid5 read, the data is DMAed from the device into the
> stripe_cache, and then memcpy is used to move it to the filesystem (or
> other client) buffer.  Worse: this memcpy happens on only one CPU so a
> multiprocessor won't make it go any after.
> 
> I would be possible to bypass the stripe_cache for reads from a
> non-degraded array (I did it for 2.4) but it is somewhat more complex
> in 2.6 and I haven't attempted it yet (there have always been other
> more interesting things to do).
> 
> To test is this is the problem you could probably just comment-out the
> memcpy (the copy_data in handle_stripe) and see if the reads go
> faster.  Obviously you will be getting garbage back, but it should
> give you a reasonably realistic measure of the cost.
> 
> NeilBrown

Neil - Thank you again for the suggestion.  I did as you said and
commented out copy_data() and ran a number of tests with the modified
kernel.  The results are in a spreadsheet-importable format at the end
of this email (let me know if I should send them in some other way).  In
short, this gives a fairly consistent 20% reduction in CPU usage under
max throughput conditions, i.e. typically that accounts for just over
half the difference in CPU usage between raid0 and raid5, everything
else being equal.  By the way, on the same machine memcpy() benchmarks
at ~1GB/s, so if the data being is read at 200MB/s and copied once that
would be about 10% CPU load - perhaps the data actually gets copied
twice?  That would be consistent.

Anyway, it seems copy_data() is definitely part of the answer, but not
the whole answer.  In the case of 32MB stripes, something else uses up
to 60% of the CPU time.  Perhaps some kind of O(n^2) scalability issue
in the stripe cache data structures?  I'm not positive, but it seems the
hit outside copy_data() is particularly large in situations in which
stripe_cache_active returns large numbers.

How hard is it to bypass the stripe cache for reads?  I would certainly
lobby for you to work on that ;) since without it raid5 is only really
suitable for database-type workloads, not multimedia-type workloads
(again bearing in mind that a full-speed read by itself uses up an
entire high-end CPU or more - you can understand why I thought it was
calculating parity ;)  I'll do what I can to help, of course.

Let me know what other tests I can run.

Regards,
--Alex

"raid level"|"num disks"|"chunk size, kB"|"copy_data disabled"|"stripe
cache size"|"block read size, MB"|"num concurrent reads"|"throughput,
MB/s"|"cpu load, %"
raid5|8|64|N|8192|8|14|186|35
raid0|7|64|-|-|8|14|243|7
raid5|8|64|N|8192|256|1|215|38
raid0|7|64|-|-|256|1|272|7
raid5|8|256|Y|8192|8|14|201|17
raid5|8|256|N|8192|8|14|200|40
raid0|7|256|-|-|8|14|241|4
raid5|8|256|Y|8192|256|1|221|17
raid5|8|256|N|8192|256|1|218|40
raid0|7|256|-|-|256|1|260|6
raid5|8|1024|Y|8192|8|14|207|20
raid5|8|1024|N|8192|8|14|206|40
raid0|7|1024|-|-|8|14|243|5
raid5|8|32768|Y|16384|8|14|227|60
raid5|8|32768|N|16384|8|14|208|80
raid0|7|32768|-|-|8|14|244|15
raid5|8|32768|Y|16384|256|1|212|25
raid5|8|32768|N|16384|256|1|207|45
raid0|7|32768|-|-|256|1|217|10

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid5 that used parity for reads only when degraded
  2006-03-24  4:38   ` Alex Izvorski
  2006-03-24  4:38     ` Neil Brown
@ 2006-03-24 17:19     ` dean gaudet
  2006-03-24 23:16       ` Alex Izvorski
  1 sibling, 1 reply; 7+ messages in thread
From: dean gaudet @ 2006-03-24 17:19 UTC (permalink / raw)
  To: Alex Izvorski; +Cc: Neil Brown, linux-raid

On Thu, 23 Mar 2006, Alex Izvorski wrote:

> Also the cpu load is measured with Andrew Morton's cyclesoak
> tool which I believe to be quite accurate.

there's something cyclesoak does which i'm not sure i agree with: 
cyclesoak process dirties an array of 1000000 bytes... so what you're 
really getting is some sort of composite measurement of memory system 
utilisation and cpu cycle availability.

i think that 1MB number was chosen before 1MiB caches were common... and 
what you get during calibration is a L2 cache-hot loop, but i'm not sure 
that's an important number.

i'd look at what happens if you increase cyclesoak.c busyloop_size to 8MB 
... and decrease it to 128.  the two extremes are going to weight the "cpu 
load" towards measuring available memory system bandwidth and available 
cpu cycles.

also for calibration consider using a larger "-p n" ... especially if 
you've got any cpufreq/powernowd setup which is varying your clock 
rates... you want to be sure that it's calibrated (and measured) at a 
fixed clock rate.

-dean

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid5 that used parity for reads only when degraded
  2006-03-24 17:19     ` raid5 that used parity for reads only when degraded dean gaudet
@ 2006-03-24 23:16       ` Alex Izvorski
  0 siblings, 0 replies; 7+ messages in thread
From: Alex Izvorski @ 2006-03-24 23:16 UTC (permalink / raw)
  To: dean gaudet; +Cc: linux-raid

On Fri, 2006-03-24 at 09:19 -0800, dean gaudet wrote:
> On Thu, 23 Mar 2006, Alex Izvorski wrote:
> 
> > Also the cpu load is measured with Andrew Morton's cyclesoak
> > tool which I believe to be quite accurate.
> 
> there's something cyclesoak does which i'm not sure i agree with: 
> cyclesoak process dirties an array of 1000000 bytes... so what you're 
> really getting is some sort of composite measurement of memory system 
> utilisation and cpu cycle availability.
> 
> i think that 1MB number was chosen before 1MiB caches were common... and 
> what you get during calibration is a L2 cache-hot loop, but i'm not sure 
> that's an important number.
> 
> i'd look at what happens if you increase cyclesoak.c busyloop_size to 8MB 
> ... and decrease it to 128.  the two extremes are going to weight the "cpu 
> load" towards measuring available memory system bandwidth and available 
> cpu cycles.
> 
> also for calibration consider using a larger "-p n" ... especially if 
> you've got any cpufreq/powernowd setup which is varying your clock 
> rates... you want to be sure that it's calibrated (and measured) at a 
> fixed clock rate.
> 
> -dean

Dean - those are interesting ideas.  I tried them out, but they do not
appear to make much difference:  the measured load with busyloop_size of
128, 1M and 8M is the same within a couple of percent.  As far as I can
determine busyloop spends most of its time in the "for (thumb = 0; thumb
< twiddle; thumb++)" loop, and only touches about 150MB memory per
second (2.3M loops/sec, one cacheline or 64 bytes affected per loop).  I
don't have cpufreq so that's not a factor.  So far everything leads me
to believe that what cyclesoak reports is quite accurate.  I've even
confirmed it by timing other cpu-bound tasks (like compressing a file in
memory) and the results are essentially identical.

Regards,
--Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-03-24 23:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-22 23:47 raid5 that used parity for reads only when degraded Alex Izvorski
2006-03-23  0:13 ` Neil Brown
2006-03-24  4:38   ` Alex Izvorski
2006-03-24  4:38     ` Neil Brown
2006-03-24  9:02       ` raid5 high cpu usage during reads Alex Izvorski
2006-03-24 17:19     ` raid5 that used parity for reads only when degraded dean gaudet
2006-03-24 23:16       ` Alex Izvorski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).