Re: RAID-6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: RAID-6
       [not found] <Pine.GSO.4.30.0211111138080.15590-100000@multivac.sdsc.edu>
@ 2002-11-11 19:47 ` H. Peter Anvin
  0 siblings, 0 replies; 20+ messages in thread
From: H. Peter Anvin @ 2002-11-11 19:47 UTC (permalink / raw)
  To: Peter L. Ashford; +Cc: linux-raid

Peter L. Ashford wrote:
> 
>>I'm playing around with RAID-6 algorithms lately.  With RAID-6 I mean
>>a setup which needs N+2 disks for N disks worth of storage and can
>>handle any two disks failing -- this seems to be the contemporary
>>definition of RAID-6 (the originally proposed "two-dimensional parity"
>>which required N+2*sqrt(N) drives never took off for obvious reasons.)
> 
> This appears to be the same as RAID-2.  Is there a web page that gives a
> more complete description?
> 
http://www.acnc.com/04_01_06.html is a pretty good high-level 
description, although it incorrectly states this is two-dimensional 
parity, which it is *NOT* -- it's a Reed-Solomon syndrome.  The 
distinction is critical in keeping the overhead down to 2 disks instead 
of 2*sqrt(N) disk.

RAID-2 uses Hamming code, according to the same web page, which has the 
property that it will correct the data *even if you can't tell which 
disks have failed*, whereas RAID-3 and higher all rely on "erasure 
information", i.e. independent means to know which disks have failed. 
In practice this information is furnished by some kind of CRC or other 
integrity check provided by the disk controller, or by the disappearance 
of said controller.

	-hpa





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Raid-6 Rebuild question
@ 2005-11-13  9:05 Brad Campbell
  2005-11-13 10:05 ` Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Brad Campbell @ 2005-11-13  9:05 UTC (permalink / raw)
  To: RAID Linux

G'day all,

Here is an interesting question( well I think so in any case ). I just replaced a failed disk in my 
15 drive Raid-6.

Simply mdadm --add /dev/md0 /dev/sdl

Why, when there is no other activity on the array at all, is it writing to every disk during the 
recovery? I would have assumed it just read from the others and write to sdl.

This is an iostat -k 5 on that machine while rebuilding

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.00    0.00  100.00    0.00    0.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             121.08     14187.95       925.30      23552       1536
sdb             127.71     14187.95      1002.41      23552       1664
sdc             125.30     14187.95      1002.41      23552       1664
sdd             122.29     14187.95      1002.41      23552       1664
sde             125.30     14187.95      1002.41      23552       1664
sdf             127.71     14187.95      1002.41      23552       1664
sdg             125.90     14187.95       925.30      23552       1536
sdh             125.30     14187.95       925.30      23552       1536
sdi             134.34     14187.95       925.30      23552       1536
sdj             137.95     14187.95       925.30      23552       1536
sdk             140.36     14187.95      1850.60      23552       3072
sdl              79.52         0.00     14265.06          0      23680
sdm             133.13     14187.95       925.30      23552       1536
sdn             134.34     14187.95       925.30      23552       1536
sdo             133.73     14187.95       925.30      23552       1536
md0               0.00         0.00         0.00          0          0

storage1:/home/brad# cat /proc/mdstat
Personalities : [raid6]
md0 : active raid6 sdl[15] sdg[6] sda[0] sdo[14] sdn[13] sdm[12] sdk[10] sdj[9] sdi[8] sdh[7] sdf[5] 
sde[4] sdd[3] sdc[2] sdb[1]
       3186525056 blocks level 6, 128k chunk, algorithm 2 [15/14] [UUUUUUUUUUU_UUU]
       [>....................]  recovery =  1.8% (4518144/245117312) finish=838.3min speed=4782K/sec
unused devices: <none>

Regards,
Brad
-- 
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Raid-6 Rebuild question
  2005-11-13  9:05 Raid-6 Rebuild question Brad Campbell
@ 2005-11-13 10:05 ` Neil Brown
  2005-11-16 17:54   ` RAID-6 Bill Davidsen
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2005-11-13 10:05 UTC (permalink / raw)
  To: Brad Campbell; +Cc: RAID Linux

On Sunday November 13, brad@wasp.net.au wrote:
> G'day all,
> 
> Here is an interesting question( well I think so in any case ). I just replaced a failed disk in my 
> 15 drive Raid-6.
> 
> Simply mdadm --add /dev/md0 /dev/sdl
> 
> Why, when there is no other activity on the array at all, is it writing to every disk during the 
> recovery? I would have assumed it just read from the others and
> write to sdl.

The raid6 recovery code always writes out the P and Q blocks for every
stripe.  This is un-necessary and there is in fact a comment in the
code saying:
	/**** FIX: Should we really do both of these unconditionally? ****/

I recently reviewed and cleaned up this code, though I haven't tested
the new version yet.  I'll make sure the new code doesn't do
un-necessary writes (it may already not).  So there is a good chance
that 2.6.16 will do a better job here.  

Thanks for the report,
NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RAID-6
  2005-11-13 10:05 ` Neil Brown
@ 2005-11-16 17:54   ` Bill Davidsen
  2005-11-16 20:39     ` RAID-6 Dan Stromberg
  0 siblings, 1 reply; 20+ messages in thread
From: Bill Davidsen @ 2005-11-16 17:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: RAID Linux

Based on some google searching on RAID-6, I find that it seems to be 
used to describe two different things. One is very similar to RAID-5, 
but with two redundancy blocks per stripe, one XOR and one CRC (or at 
any rate two methods are employed). The other sources define RAID-6 as 
RAID-5 with a distributed hot spare, AKA RAID-5E, which spreads head 
motion to all drives for performance.

Any clarification on this?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2005-11-16 17:54   ` RAID-6 Bill Davidsen
@ 2005-11-16 20:39     ` Dan Stromberg
  2005-12-29 18:29       ` RAID-6 H. Peter Anvin
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Stromberg @ 2005-11-16 20:39 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, RAID Linux, strombrg


My understanding is that RAID 5 -always- stripes parity.  If it didn't,
I believe it would be RAID 4.

You may find http://linux.cudeso.be/raid.php of interest.

I don't think RAID level 6 was in the original RAID paper, so vendors
may have decided on their own that it should mean what they're
selling.  :)

On Wed, 2005-11-16 at 12:54 -0500, Bill Davidsen wrote:
> Based on some google searching on RAID-6, I find that it seems to be 
> used to describe two different things. One is very similar to RAID-5, 
> but with two redundancy blocks per stripe, one XOR and one CRC (or at 
> any rate two methods are employed). The other sources define RAID-6 as 
> RAID-5 with a distributed hot spare, AKA RAID-5E, which spreads head 
> motion to all drives for performance.
> 
> Any clarification on this?
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2005-11-16 20:39     ` RAID-6 Dan Stromberg
@ 2005-12-29 18:29       ` H. Peter Anvin
  0 siblings, 0 replies; 20+ messages in thread
From: H. Peter Anvin @ 2005-12-29 18:29 UTC (permalink / raw)
  To: linux-raid

Followup to:  <1132173592.23464.459.camel@seki.nac.uci.edu>
By author:    Dan Stromberg <strombrg@dcs.nac.uci.edu>
In newsgroup: linux.dev.raid
>
> 
> My understanding is that RAID 5 -always- stripes parity.  If it didn't,
> I believe it would be RAID 4.
> 
> You may find http://linux.cudeso.be/raid.php of interest.
> 
> I don't think RAID level 6 was in the original RAID paper, so vendors
> may have decided on their own that it should mean what they're
> selling.  :)
> 

RAID-6 wasn't in the original RAID paper, but the term RAID-6 with the
P+Q parity defintion is by far the dominant use of the term, and I
believe it is/was recognized by the RAID Advisory Board, which is as
close as you can get to an official statement.  The RAB seems to have
gotten defunct, with a standard squatter page on their previous web
address.

	-hpa

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RAID-6
@ 2002-11-11 18:52 H. Peter Anvin
  2002-11-11 21:06 ` RAID-6 Derek Vadala
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: H. Peter Anvin @ 2002-11-11 18:52 UTC (permalink / raw)
  To: linux-raid

Hi all,

I'm playing around with RAID-6 algorithms lately.  With RAID-6 I mean
a setup which needs N+2 disks for N disks worth of storage and can
handle any two disks failing -- this seems to be the contemporary
definition of RAID-6 (the originally proposed "two-dimensional parity"
which required N+2*sqrt(N) drives never took off for obvious reasons.)

Based on my current research, I think the following should be true:

a) write performance will be worse than RAID-5, but I believe it can
   be kept to within a factor of 1.5-2.0 on machines with suitable
   SIMD instruction sets (e.g. MMX or SSE-2);

b) read performance in normal and single failure degraded mode will be
   comparable to RAID-5;

c) read performance in dual failure degraded mode will be quite bad.

I'm curious how much interest there would be in this, since I
certainly have enough projects without it, and I'm probably going to
need some of Neil's time to integrate it into the md driver and the
tools.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-11 18:52 RAID-6 H. Peter Anvin
@ 2002-11-11 21:06 ` Derek Vadala
  2002-11-11 22:44 ` RAID-6 Mr. James W. Laferriere
  2002-11-12 16:22 ` RAID-6 Jakob Oestergaard
  2 siblings, 0 replies; 20+ messages in thread
From: Derek Vadala @ 2002-11-11 21:06 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

On 11 Nov 2002, H. Peter Anvin wrote:

> I'm curious how much interest there would be in this, since I
> certainly have enough projects without it, and I'm probably going to
> need some of Neil's time to integrate it into the md driver and the
> tools.

There was quite a long thread about this last June ( it starts here
http://marc.theaimsgroup.com/?l=linux-raid&m=102305890732421&w=2). I've
seen lack of RAID-6 support cited as one of the shortcoming of Linux SW
RAID quite a few times, by quite a few sources. That seems like one reason
to implement it.

--
Derek Vadala, derek@cynicism.com, http://www.cynicism.com/~derek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-11 18:52 RAID-6 H. Peter Anvin
  2002-11-11 21:06 ` RAID-6 Derek Vadala
@ 2002-11-11 22:44 ` Mr. James W. Laferriere
  2002-11-11 23:05   ` RAID-6 H. Peter Anvin
  2002-11-12 16:22 ` RAID-6 Jakob Oestergaard
  2 siblings, 1 reply; 20+ messages in thread
From: Mr. James W. Laferriere @ 2002-11-11 22:44 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid


	Hello Peter ,

On 11 Nov 2002, H. Peter Anvin wrote:
> Hi all,
> I'm playing around with RAID-6 algorithms lately.  With RAID-6 I mean
> a setup which needs N+2 disks for N disks worth of storage and can
> handle any two disks failing -- this seems to be the contemporary
> definition of RAID-6 (the originally proposed "two-dimensional parity"
> which required N+2*sqrt(N) drives never took off for obvious reasons.)
	Was there a discussion of the 'two-dimensional parity' on the
	list ?  I don't remember any (of course) .  But what other than
	98+2+10 ,  What was the main difficulty ?  I don't (personally)
	see any difficulty (other than managability/power/space) to the
	ammount of disks required .  Tia ,  JimL
--
       +------------------------------------------------------------------+
       | James   W.   Laferriere | System    Techniques | Give me VMS     |
       | Network        Engineer |     P.O. Box 854     |  Give me Linux  |
       | babydr@baby-dragons.com | Coudersport PA 16915 |   only  on  AXP |
       +------------------------------------------------------------------+


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-11 22:44 ` RAID-6 Mr. James W. Laferriere
@ 2002-11-11 23:05   ` H. Peter Anvin
  0 siblings, 0 replies; 20+ messages in thread
From: H. Peter Anvin @ 2002-11-11 23:05 UTC (permalink / raw)
  To: Mr. James W. Laferriere; +Cc: linux-raid

Mr. James W. Laferriere wrote:
> 	Hello Peter ,
> 
> On 11 Nov 2002, H. Peter Anvin wrote:
> 
>>Hi all,
>>I'm playing around with RAID-6 algorithms lately.  With RAID-6 I mean
>>a setup which needs N+2 disks for N disks worth of storage and can
>>handle any two disks failing -- this seems to be the contemporary
>>definition of RAID-6 (the originally proposed "two-dimensional parity"
>>which required N+2*sqrt(N) drives never took off for obvious reasons.)
> 
> 	Was there a discussion of the 'two-dimensional parity' on the
> 	list ?  I don't remember any (of course) .  But what other than
> 	98+2+10 ,  What was the main difficulty ?  I don't (personally)
> 	see any difficulty (other than managability/power/space) to the
> 	ammount of disks required .  Tia ,  JimL
>

No discussion of two-dimensional parity, but that was the originally
proposed RAID-6.  Noone ever productized a solution like that to the
best of my knowledge.  I don't know what you mean with "98+2+10", but
the basic problem is that with 2D parity, for N data drives you need
2*sqrt(N) redundancy drives, which for any moderate-sized RAID is a lot
(with 9 data drives you need 6 redundancy drives, so you have 67%
overhead.)  You also have the same kind of performance problems as
RAID-4 does, because the "rotating parity" trick of RAID-5 does not work
in two dimensions.  And for all of this, you're not *guaranteed* more
than dual failure recovery (although you might, probabilistically, luck
out and have more than that.)

P+Q redundancy, the current meaning of RAID-6, instead uses two
orthogonal redundancy functions so you only need two redundancy drives
regardless of how much data you have, and you can apply the RAID-5 trick
of rotating the parity around.  So from your 15 drives in the example
above, you get 13 drives worth of data instead of 9.

	-hpa

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-11 18:52 RAID-6 H. Peter Anvin
  2002-11-11 21:06 ` RAID-6 Derek Vadala
  2002-11-11 22:44 ` RAID-6 Mr. James W. Laferriere
@ 2002-11-12 16:22 ` Jakob Oestergaard
  2002-11-12 16:30   ` RAID-6 H. Peter Anvin
  2002-11-12 19:37   ` RAID-6 Neil Brown
  2 siblings, 2 replies; 20+ messages in thread
From: Jakob Oestergaard @ 2002-11-12 16:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

On Mon, Nov 11, 2002 at 10:52:36AM -0800, H. Peter Anvin wrote:
> Hi all,
> 
> I'm playing around with RAID-6 algorithms lately.  With RAID-6 I mean
> a setup which needs N+2 disks for N disks worth of storage and can
> handle any two disks failing -- this seems to be the contemporary
> definition of RAID-6 (the originally proposed "two-dimensional parity"
> which required N+2*sqrt(N) drives never took off for obvious reasons.)
> 
> Based on my current research, I think the following should be true:
> 
> a) write performance will be worse than RAID-5, but I believe it can
>    be kept to within a factor of 1.5-2.0 on machines with suitable
>    SIMD instruction sets (e.g. MMX or SSE-2);

Please note that raw CPU power is usually *not* a limiting (or even
significantly contributing) factor, on modern systems.

Limitations are disk reads/writes/seeks, bus bandwidth, etc.

You will probably cause more bus activity with RAID-6, and that might
degrade performance. But I don't think you need to worry about
MMX/SSE/...  If you can do as well as the current RAID-5 code, then you
will be in the clear until people have 1GB/sec disk transfer-rates on
500MHz PIII systems  ;)

> 
> b) read performance in normal and single failure degraded mode will be
>    comparable to RAID-5;

Which again is like a RAID-0 with some extra seeks... Eg. not too bad
with huge chunk sizes.

You might want to consider using huge chunk-sizes when reading, but
making sure that writes can be made on "sub-chunks" - so that one could
run a RAID-6 with a 128k chunk size, yet have writes performed on 4k
chunks.  This is important for performance on both read and write, but
it is an optimization the current RAID-5 code lacks.

> 
> c) read performance in dual failure degraded mode will be quite bad.
> 
> I'm curious how much interest there would be in this, since I
> certainly have enough projects without it, and I'm probably going to
> need some of Neil's time to integrate it into the md driver and the
> tools.

I've seen quite some people ask for it.  You might find a friend in "Roy
Sigurd Karlsbach" - he for one has been asking (loudly) for it  ;)

Go Peter!    ;)

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-12 16:22 ` RAID-6 Jakob Oestergaard
@ 2002-11-12 16:30   ` H. Peter Anvin
  2002-11-12 19:01     ` RAID-6 H. Peter Anvin
  2002-11-12 19:37   ` RAID-6 Neil Brown
  1 sibling, 1 reply; 20+ messages in thread
From: H. Peter Anvin @ 2002-11-12 16:30 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: linux-raid

Jakob Oestergaard wrote:
>>
>>a) write performance will be worse than RAID-5, but I believe it can
>>   be kept to within a factor of 1.5-2.0 on machines with suitable
>>   SIMD instruction sets (e.g. MMX or SSE-2);
> 
> Please note that raw CPU power is usually *not* a limiting (or even
> significantly contributing) factor, on modern systems.
> 
> Limitations are disk reads/writes/seeks, bus bandwidth, etc.
> 
> You will probably cause more bus activity with RAID-6, and that might
> degrade performance. But I don't think you need to worry about
> MMX/SSE/...  If you can do as well as the current RAID-5 code, then you
> will be in the clear until people have 1GB/sec disk transfer-rates on
> 500MHz PIII systems  ;)
> 

RAID-6 will, obviously, never do as well as RAID-5 -- you are doing more 
work (both computational and data-pushing.)  The RAID-6 syndrome 
computation is actually extrememly expensive if you can't do it in 
parallel.  Fortunately there is a way to do it in parallel using MMX or 
SSE-2, although it seems to exist by pure dumb luck -- certainly not by 
design.  I've tried to figure out how to generalize to using regular 
32-bit or 64-bit integer registers, but it doesn't seem to work there.

Again, my initial analysis seems to indicate performance within about a 
factor of 2.

>>b) read performance in normal and single failure degraded mode will be
>>   comparable to RAID-5;
> 
> Which again is like a RAID-0 with some extra seeks... Eg. not too bad
> with huge chunk sizes.
> 
> You might want to consider using huge chunk-sizes when reading, but
> making sure that writes can be made on "sub-chunks" - so that one could
> run a RAID-6 with a 128k chunk size, yet have writes performed on 4k
> chunks.  This is important for performance on both read and write, but
> it is an optimization the current RAID-5 code lacks.

That's an issue for the common framework, I'll leave that to Neil.  It's 
functionally equivalent between RAID-5 and -6.

>>c) read performance in dual failure degraded mode will be quite bad.
>>
>>I'm curious how much interest there would be in this, since I
>>certainly have enough projects without it, and I'm probably going to
>>need some of Neil's time to integrate it into the md driver and the
>>tools.
> 
> I've seen quite some people ask for it.  You might find a friend in "Roy
> Sigurd Karlsbach" - he for one has been asking (loudly) for it  ;)

:)  Enough people have responded that I think I have a project...

	-hpa



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-12 16:30   ` RAID-6 H. Peter Anvin
@ 2002-11-12 19:01     ` H. Peter Anvin
  0 siblings, 0 replies; 20+ messages in thread
From: H. Peter Anvin @ 2002-11-12 19:01 UTC (permalink / raw)
  To: linux-raid

Followup to:  <3DD12CA3.5090105@zytor.com>
By author:    "H. Peter Anvin" <hpa@zytor.com>
In newsgroup: linux.dev.raid
> 
> I've tried to figure out how to generalize to using regular 32-bit
> or 64-bit integer registers, but it doesn't seem to work there.
> 

Of course, once I had said that I just had to go and figure it out :)
This is good, because it's inherently the highest performance we can
get out of portable code.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-12 16:22 ` RAID-6 Jakob Oestergaard
  2002-11-12 16:30   ` RAID-6 H. Peter Anvin
@ 2002-11-12 19:37   ` Neil Brown
  2002-11-13  2:13     ` RAID-6 Jakob Oestergaard
  1 sibling, 1 reply; 20+ messages in thread
From: Neil Brown @ 2002-11-12 19:37 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: H. Peter Anvin, linux-raid

On Tuesday November 12, jakob@unthought.net wrote:
> 
> You might want to consider using huge chunk-sizes when reading, but
> making sure that writes can be made on "sub-chunks" - so that one could
> run a RAID-6 with a 128k chunk size, yet have writes performed on 4k
> chunks.  This is important for performance on both read and write, but
> it is an optimization the current RAID-5 code lacks.

Either I misunderstand your point, or you misunderstand the code.

A 4k write request will cause a 4k write to a data block and a 4k
write to a parity block, no matter what the chunk size is. (There may
also be pre-reading, and possibly several 4k writes will share a
parity block update).

I see no lacking optimisation, but if you do, I would be keen to hear
a more detailed explanation.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-12 19:37   ` RAID-6 Neil Brown
@ 2002-11-13  2:13     ` Jakob Oestergaard
  2002-11-13  3:33       ` RAID-6 Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Jakob Oestergaard @ 2002-11-13  2:13 UTC (permalink / raw)
  To: Neil Brown; +Cc: H. Peter Anvin, linux-raid

On Wed, Nov 13, 2002 at 06:37:40AM +1100, Neil Brown wrote:
> On Tuesday November 12, jakob@unthought.net wrote:
> > 
> > You might want to consider using huge chunk-sizes when reading, but
> > making sure that writes can be made on "sub-chunks" - so that one could
> > run a RAID-6 with a 128k chunk size, yet have writes performed on 4k
> > chunks.  This is important for performance on both read and write, but
> > it is an optimization the current RAID-5 code lacks.
> 
> Either I misunderstand your point, or you misunderstand the code.
> 
> A 4k write request will cause a 4k write to a data block and a 4k
> write to a parity block, no matter what the chunk size is. (There may
> also be pre-reading, and possibly several 4k writes will share a
> parity block update).

Writes on a 128k chunk array are significantly slower than writes on a
4k chunk array, according to someone else on this list   -  I wanted to
look into this myself, but now is just a bad time for me (nothing new
on that front).

The benchmark goes:

| some tests on raid5 with 4k and 128k chunk size. The results are as follows:
| Access Spec     4K(MBps)        4K-deg(MBps)    128K(MBps) 128K-deg(MBps)
| 2K Seq Read     23.015089       33.293993       25.415035  32.669278
| 2K Seq Write    27.363041       30.555328       14.185889  16.087862
| 64K Seq Read    22.952559       44.414774       26.02711   44.036993
| 64K Seq Write   25.171833       32.67759        13.97861   15.618126

So down from 27MB/sec to 14MB/sec running 2k-block sequential writes on
a 128k chunk array versus a 4k chunk array (non-degraded).

In degraded mode, the writes degenerate from 30MB/sec to 16MB/sec as the
chunk-size increases.

Something's fishy.

> 
> I see no lacking optimisation, but if you do, I would be keen to hear
> a more detailed explanation.

Well if a 4k write really only causes a 4k write to disk, even with a
128k chunk-size array, then something else is happening...

I didn't do the benchmark, and I didn't get to investigate it further
here, so I can't really say much else productive  :)


 / Jakob "linux-raid message multiplexer" Østergaard

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-13  2:13     ` RAID-6 Jakob Oestergaard
@ 2002-11-13  3:33       ` Neil Brown
  2002-11-13 12:29         ` RAID-6 Jakob Oestergaard
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2002-11-13  3:33 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: H. Peter Anvin, linux-raid

On Wednesday November 13, jakob@unthought.net wrote:
> 
> Writes on a 128k chunk array are significantly slower than writes on a
> 4k chunk array, according to someone else on this list   -  I wanted to
> look into this myself, but now is just a bad time for me (nothing new
> on that front).
> 
> The benchmark goes:
> 
> | some tests on raid5 with 4k and 128k chunk size. The results are as follows:
> | Access Spec     4K(MBps)        4K-deg(MBps)    128K(MBps) 128K-deg(MBps)
> | 2K Seq Read     23.015089       33.293993       25.415035  32.669278
> | 2K Seq Write    27.363041       30.555328       14.185889  16.087862
> | 64K Seq Read    22.952559       44.414774       26.02711   44.036993
> | 64K Seq Write   25.171833       32.67759        13.97861   15.618126
> 
> So down from 27MB/sec to 14MB/sec running 2k-block sequential writes on
> a 128k chunk array versus a 4k chunk array (non-degraded).

When doing sequential writes, a small chunk size means you are more
likely to fill up a whole stripe before data is flushed to disk, so it
is very possible that you wont need to pre-read parity at all.  With a
larger chunksize, it is more likely that you will have to write, and
possibly read, the parity block several times.

So if you are doing single threaded sequential accesses, a smaller
chunk size is definately better.
If you are doing lots of parallel accesses (typical multi-user work
load), small chunk sizes tends to mean that every access goes to all
drives so there is lots of contention.  In theory a larger chunk size
means that more accesses will be entirely satisfied from just one disk,
so there it more opportunity for concurrency between the different
users.

As always, the best way to choose a chunk size is develop a realistic
work load and test it against several different chunk sizes.   There
is no rule like "bigger is better" or "smaller is better".

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-13  3:33       ` RAID-6 Neil Brown
@ 2002-11-13 12:29         ` Jakob Oestergaard
  2002-11-13 17:33           ` RAID-6 H. Peter Anvin
                             ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Jakob Oestergaard @ 2002-11-13 12:29 UTC (permalink / raw)
  To: Neil Brown; +Cc: H. Peter Anvin, linux-raid

On Wed, Nov 13, 2002 at 02:33:46PM +1100, Neil Brown wrote:
...
> > The benchmark goes:
> > 
> > | some tests on raid5 with 4k and 128k chunk size. The results are as follows:
> > | Access Spec     4K(MBps)        4K-deg(MBps)    128K(MBps) 128K-deg(MBps)
> > | 2K Seq Read     23.015089       33.293993       25.415035  32.669278
> > | 2K Seq Write    27.363041       30.555328       14.185889  16.087862
> > | 64K Seq Read    22.952559       44.414774       26.02711   44.036993
> > | 64K Seq Write   25.171833       32.67759        13.97861   15.618126
> > 
> > So down from 27MB/sec to 14MB/sec running 2k-block sequential writes on
> > a 128k chunk array versus a 4k chunk array (non-degraded).
> 
> When doing sequential writes, a small chunk size means you are more
> likely to fill up a whole stripe before data is flushed to disk, so it
> is very possible that you wont need to pre-read parity at all.  With a
> larger chunksize, it is more likely that you will have to write, and
> possibly read, the parity block several times.

Except if one worked on 4k sub-chunks - right  ?   :)

> 
> So if you are doing single threaded sequential accesses, a smaller
> chunk size is definately better.

Definitely not so for reads - the seeking past the parity blocks ruin
sequential read performance when we do many such seeks (eg. when we have
small chunks) - as witnessed by the benchmark data above.

> If you are doing lots of parallel accesses (typical multi-user work
> load), small chunk sizes tends to mean that every access goes to all
> drives so there is lots of contention.  In theory a larger chunk size
> means that more accesses will be entirely satisfied from just one disk,
> so there it more opportunity for concurrency between the different
> users.
> 
> As always, the best way to choose a chunk size is develop a realistic
> work load and test it against several different chunk sizes.   There
> is no rule like "bigger is better" or "smaller is better".

For a single reader/writer, it was pretty obvious from the above that
"big is good" for reads (because of the fewer parity block skip seeks),
and "small is good" for writes.

So, by making a big chunk-sized array, and having it work on 4k
sub-chunks for writes, was some idea I had which I felt would just give
the best scenario in both cases.

Am I smoking crack, or ?  ;)

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-13 12:29         ` RAID-6 Jakob Oestergaard
@ 2002-11-13 17:33           ` H. Peter Anvin
  2002-11-13 18:07             ` RAID-6 Peter L. Ashford
  2002-11-13 22:50             ` RAID-6 Neil Brown
  2002-11-13 18:42           ` RAID-6 Peter L. Ashford
  2002-11-13 22:48           ` RAID-6 Neil Brown
  2 siblings, 2 replies; 20+ messages in thread
From: H. Peter Anvin @ 2002-11-13 17:33 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: Neil Brown, linux-raid

Jakob Oestergaard wrote:
>>
>>When doing sequential writes, a small chunk size means you are more
>>likely to fill up a whole stripe before data is flushed to disk, so it
>>is very possible that you wont need to pre-read parity at all.  With a
>>larger chunksize, it is more likely that you will have to write, and
>>possibly read, the parity block several times.
> 
> Except if one worked on 4k sub-chunks - right  ?   :)
> 

No.  You probably want to look at the difference between RAID-3 and 
RAID-4 (RAID-5 being basically RAID-4 twisted around in a rotating pattern.)

> 
> So, by making a big chunk-sized array, and having it work on 4k
> sub-chunks for writes, was some idea I had which I felt would just give
> the best scenario in both cases.
> 
> Am I smoking crack, or ?  ;)
> 

No, you're confusing RAID-3 and RAID-4/5.  In RAID-3, sequential blocks 
are organized as:

	DISKS ------------------------------------>
	 0	 1	 2	 3	PARITY
	 4	 5	 6	 7	PARITY
	 8	 9	10	11	PARITY
	12	13	14	15	PARITY

... whereas in RAID-4 with a chunk size of four blocks it's:

	DISKS ------------------------------------>
	 0	 4	 8	12	PARITY
	 1	 5	 9	13	PARITY
	 2	 6	10	14	PARITY
	 3	 7	11	15	PARITY

If you only write blocks 0-3 you *have* to read in the 12 data blocks 
and write out all 4 parity blocks, whereas in RAID-3 you can get away 
with only writing 5 blocks.  [Well, technically you could also do a 
read-modify-write on the parity, since parity is linear.  This would 
greatly complicate the code.]

Therefore, for small sequential writes chunking is *inherently* a lose, 
and there isn't much you can do about it.

	-hpa

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-13 17:33           ` RAID-6 H. Peter Anvin
@ 2002-11-13 18:07             ` Peter L. Ashford
  2002-11-13 22:50             ` RAID-6 Neil Brown
  1 sibling, 0 replies; 20+ messages in thread
From: Peter L. Ashford @ 2002-11-13 18:07 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

Peter,

> No, you're confusing RAID-3 and RAID-4/5.  In RAID-3, sequential blocks
> are organized as:
>
> 	DISKS ------------------------------------>
> 	 0	 1	 2	 3	PARITY
> 	 4	 5	 6	 7	PARITY
> 	 8	 9	10	11	PARITY
> 	12	13	14	15	PARITY
>
> ... whereas in RAID-4 with a chunk size of four blocks it's:
>
> 	DISKS ------------------------------------>
> 	 0	 4	 8	12	PARITY
> 	 1	 5	 9	13	PARITY
> 	 2	 6	10	14	PARITY
> 	 3	 7	11	15	PARITY

The description you have for RAID-3 is wrong.  What you give as RAID-3 is
actually RAID-4 with a 1 block segment size.

RAID-3 uses BITWISE (or BYTEWISE) striping with parity, as opposed to the
BLOCKWISE striping with parity in RAID-4/5.  In RAID-3, every I/O
transaction (regardless of size) accesses all drives in the array (a read
doesn't have to access the parity).  This gives high transfer rates, even
on single-block transactions.

 	DISKS ------------------------------------>
 	0.0	0.1	0.2	0.3	PARITY
 	1.0	1.1	1.2	1.3	PARITY
 	2.0	2.1	2.2	2.3	PARITY
 	3.0	3.1	3.2	3.3	PARITY

It is NOT possible to read just 0.0 from the array.  If you were to read
the raw physical device, the results would be meaningless.

The structure of the array limits the number of spindles to one more than
a power of two (3, 5, 9, 17, etc.).  I have seen this implemented with 5
and 9 drives (actually, the 9 was done with parallel heads).

				Peter Ashford

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-13 17:33           ` RAID-6 H. Peter Anvin
  2002-11-13 18:07             ` RAID-6 Peter L. Ashford
@ 2002-11-13 22:50             ` Neil Brown
  1 sibling, 0 replies; 20+ messages in thread
From: Neil Brown @ 2002-11-13 22:50 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Jakob Oestergaard, linux-raid

On Wednesday November 13, hpa@zytor.com wrote:
> 
> 	DISKS ------------------------------------>
> 	 0	 4	 8	12	PARITY
> 	 1	 5	 9	13	PARITY
> 	 2	 6	10	14	PARITY
> 	 3	 7	11	15	PARITY
> 
> If you only write blocks 0-3 you *have* to read in the 12 data blocks 
> and write out all 4 parity blocks, whereas in RAID-3 you can get away 
> with only writing 5 blocks.  [Well, technically you could also do a 
> read-modify-write on the parity, since parity is linear.  This would 
> greatly complicate the code.]

We do read-modify-write if it involves fewer pre-reads than
reconstruct-write.
so in the above scenario, writing blocks 0,1,2,3 would cause a
pre-read of those blocks and the 4 parity blocks, and then all 8
blocks would be re-written.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-13 12:29         ` RAID-6 Jakob Oestergaard
  2002-11-13 17:33           ` RAID-6 H. Peter Anvin
@ 2002-11-13 18:42           ` Peter L. Ashford
  2002-11-13 22:48           ` RAID-6 Neil Brown
  2 siblings, 0 replies; 20+ messages in thread
From: Peter L. Ashford @ 2002-11-13 18:42 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: linux-raid

Jakob,

SNIP

> For a single reader/writer, it was pretty obvious from the above that
> "big is good" for reads (because of the fewer parity block skip seeks),
> and "small is good" for writes.
>
> So, by making a big chunk-sized array, and having it work on 4k
> sub-chunks for writes, was some idea I had which I felt would just give
> the best scenario in both cases.

Actually, the problem is worse than you describe.

Let's assume that we have a RAID-5 array of 5 disks, with a segment size
of 64KB.  In this instance, the optimum I/O size will be 256KB.
Furthermore, that will only be the optimum I/O when it is on a 256KB
boundary.

I have, in the past, performed I/O benchmarks on raw arrays (both using
the MD driver, and using 3Ware cards).  My results show that read speed
drops off when the segment size passes 128KB, but write speed stays stable
up to 2MB (the largest I/O size I tested).

This information, combined with the benchmarks you posted earlier, shows
that the write slowdown when writing large I/O sizes is caused by the
file-system structure.

Current Linux file-systems don't support block sizes larger than 4KB.
This means that even if you perform the optimum sized I/O, there is no
guarantee that the I/O will occur on the optimum boundary (it's actually
quite unlikely).  To make matters worse, there is no guarantee that when
you perform a large write, all the data will be placed in contiguous
blocks.

In order to maximize I/O throughput, it will be necessary to create a
Linux file-system that can effectively deal with large blocks (not
necessarily power of two in size).  The alternative would be to work with
the raw file-system, as many DBMS' do.

I have worked with a file-system structure that deals well with large
blocks, but it is not in the public domain, and I doubt that CRAY is
interested in porting the NC1FS structure to Linux.

				Peter Ashford

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID-6
  2002-11-13 12:29         ` RAID-6 Jakob Oestergaard
  2002-11-13 17:33           ` RAID-6 H. Peter Anvin
  2002-11-13 18:42           ` RAID-6 Peter L. Ashford
@ 2002-11-13 22:48           ` Neil Brown
  2 siblings, 0 replies; 20+ messages in thread
From: Neil Brown @ 2002-11-13 22:48 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: H. Peter Anvin, linux-raid

On Wednesday November 13, jakob@unthought.net wrote:
> On Wed, Nov 13, 2002 at 02:33:46PM +1100, Neil Brown wrote:
> ...
> > > The benchmark goes:
> > > 
> > > | some tests on raid5 with 4k and 128k chunk size. The results are as follows:
> > > | Access Spec     4K(MBps)        4K-deg(MBps)    128K(MBps) 128K-deg(MBps)
> > > | 2K Seq Read     23.015089       33.293993       25.415035  32.669278
> > > | 2K Seq Write    27.363041       30.555328       14.185889  16.087862
> > > | 64K Seq Read    22.952559       44.414774       26.02711   44.036993
> > > | 64K Seq Write   25.171833       32.67759        13.97861   15.618126
> > > 

These numbers look ... interesting.  I might try to reproduce them
myself.

> > > So down from 27MB/sec to 14MB/sec running 2k-block sequential writes on
> > > a 128k chunk array versus a 4k chunk array (non-degraded).
> > 
> > When doing sequential writes, a small chunk size means you are more
> > likely to fill up a whole stripe before data is flushed to disk, so it
> > is very possible that you wont need to pre-read parity at all.  With a
> > larger chunksize, it is more likely that you will have to write, and
> > possibly read, the parity block several times.
> 
> Except if one worked on 4k sub-chunks - right  ?   :)

I still don't understand.... We *do* work with 4k subchunks.

> 
> > 
> > So if you are doing single threaded sequential accesses, a smaller
> > chunk size is definately better.
> 
> Definitely not so for reads - the seeking past the parity blocks ruin
> sequential read performance when we do many such seeks (eg. when we have
> small chunks) - as witnessed by the benchmark data above.

Parity blocks aren't big enough to have to seek past.  I would imagine
that a modern drive would read a whole track into cache on the first
read request, and then find the required data, just past the parity
block, in the cache on the second request.  By maybe I'm wrong.

Or there could be some factor in the device driver where lots of
little read request, even though they are almost consecutive, are
handled more poorly than a few large read requests.
I wonder if it would be worth reading those parity blocks anyway if a
sequential read were detected....

> 
> > If you are doing lots of parallel accesses (typical multi-user work
> > load), small chunk sizes tends to mean that every access goes to all
> > drives so there is lots of contention.  In theory a larger chunk size
> > means that more accesses will be entirely satisfied from just one disk,
> > so there it more opportunity for concurrency between the different
> > users.
> > 
> > As always, the best way to choose a chunk size is develop a realistic
> > work load and test it against several different chunk sizes.   There
> > is no rule like "bigger is better" or "smaller is better".
> 
> For a single reader/writer, it was pretty obvious from the above that
> "big is good" for reads (because of the fewer parity block skip seeks),
> and "small is good" for writes.
> 
> So, by making a big chunk-sized array, and having it work on 4k
> sub-chunks for writes, was some idea I had which I felt would just give
> the best scenario in both cases.

The issue isn't so much the IO size as the layout of disk.  You cannot
use one layout for read and a different layout for write.  That
obviously doesn't make sense.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2005-12-29 18:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.GSO.4.30.0211111138080.15590-100000@multivac.sdsc.edu>
2002-11-11 19:47 ` RAID-6 H. Peter Anvin
2005-11-13  9:05 Raid-6 Rebuild question Brad Campbell
2005-11-13 10:05 ` Neil Brown
2005-11-16 17:54   ` RAID-6 Bill Davidsen
2005-11-16 20:39     ` RAID-6 Dan Stromberg
2005-12-29 18:29       ` RAID-6 H. Peter Anvin
  -- strict thread matches above, loose matches on Subject: below --
2002-11-11 18:52 RAID-6 H. Peter Anvin
2002-11-11 21:06 ` RAID-6 Derek Vadala
2002-11-11 22:44 ` RAID-6 Mr. James W. Laferriere
2002-11-11 23:05   ` RAID-6 H. Peter Anvin
2002-11-12 16:22 ` RAID-6 Jakob Oestergaard
2002-11-12 16:30   ` RAID-6 H. Peter Anvin
2002-11-12 19:01     ` RAID-6 H. Peter Anvin
2002-11-12 19:37   ` RAID-6 Neil Brown
2002-11-13  2:13     ` RAID-6 Jakob Oestergaard
2002-11-13  3:33       ` RAID-6 Neil Brown
2002-11-13 12:29         ` RAID-6 Jakob Oestergaard
2002-11-13 17:33           ` RAID-6 H. Peter Anvin
2002-11-13 18:07             ` RAID-6 Peter L. Ashford
2002-11-13 22:50             ` RAID-6 Neil Brown
2002-11-13 18:42           ` RAID-6 Peter L. Ashford
2002-11-13 22:48           ` RAID-6 Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).