* Re: RAID-6 [not found] <Pine.GSO.4.30.0211111138080.15590-100000@multivac.sdsc.edu> @ 2002-11-11 19:47 ` H. Peter Anvin 0 siblings, 0 replies; 20+ messages in thread From: H. Peter Anvin @ 2002-11-11 19:47 UTC (permalink / raw) To: Peter L. Ashford; +Cc: linux-raid Peter L. Ashford wrote: > >>I'm playing around with RAID-6 algorithms lately. With RAID-6 I mean >>a setup which needs N+2 disks for N disks worth of storage and can >>handle any two disks failing -- this seems to be the contemporary >>definition of RAID-6 (the originally proposed "two-dimensional parity" >>which required N+2*sqrt(N) drives never took off for obvious reasons.) > > This appears to be the same as RAID-2. Is there a web page that gives a > more complete description? > http://www.acnc.com/04_01_06.html is a pretty good high-level description, although it incorrectly states this is two-dimensional parity, which it is *NOT* -- it's a Reed-Solomon syndrome. The distinction is critical in keeping the overhead down to 2 disks instead of 2*sqrt(N) disk. RAID-2 uses Hamming code, according to the same web page, which has the property that it will correct the data *even if you can't tell which disks have failed*, whereas RAID-3 and higher all rely on "erasure information", i.e. independent means to know which disks have failed. In practice this information is furnished by some kind of CRC or other integrity check provided by the disk controller, or by the disappearance of said controller. -hpa ^ permalink raw reply [flat|nested] 20+ messages in thread
* Raid-6 Rebuild question
@ 2005-11-13 9:05 Brad Campbell
2005-11-13 10:05 ` Neil Brown
0 siblings, 1 reply; 20+ messages in thread
From: Brad Campbell @ 2005-11-13 9:05 UTC (permalink / raw)
To: RAID Linux
G'day all,
Here is an interesting question( well I think so in any case ). I just replaced a failed disk in my
15 drive Raid-6.
Simply mdadm --add /dev/md0 /dev/sdl
Why, when there is no other activity on the array at all, is it writing to every disk during the
recovery? I would have assumed it just read from the others and write to sdl.
This is an iostat -k 5 on that machine while rebuilding
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.00 100.00 0.00 0.00
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 121.08 14187.95 925.30 23552 1536
sdb 127.71 14187.95 1002.41 23552 1664
sdc 125.30 14187.95 1002.41 23552 1664
sdd 122.29 14187.95 1002.41 23552 1664
sde 125.30 14187.95 1002.41 23552 1664
sdf 127.71 14187.95 1002.41 23552 1664
sdg 125.90 14187.95 925.30 23552 1536
sdh 125.30 14187.95 925.30 23552 1536
sdi 134.34 14187.95 925.30 23552 1536
sdj 137.95 14187.95 925.30 23552 1536
sdk 140.36 14187.95 1850.60 23552 3072
sdl 79.52 0.00 14265.06 0 23680
sdm 133.13 14187.95 925.30 23552 1536
sdn 134.34 14187.95 925.30 23552 1536
sdo 133.73 14187.95 925.30 23552 1536
md0 0.00 0.00 0.00 0 0
storage1:/home/brad# cat /proc/mdstat
Personalities : [raid6]
md0 : active raid6 sdl[15] sdg[6] sda[0] sdo[14] sdn[13] sdm[12] sdk[10] sdj[9] sdi[8] sdh[7] sdf[5]
sde[4] sdd[3] sdc[2] sdb[1]
3186525056 blocks level 6, 128k chunk, algorithm 2 [15/14] [UUUUUUUUUUU_UUU]
[>....................] recovery = 1.8% (4518144/245117312) finish=838.3min speed=4782K/sec
unused devices: <none>
Regards,
Brad
--
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: Raid-6 Rebuild question 2005-11-13 9:05 Raid-6 Rebuild question Brad Campbell @ 2005-11-13 10:05 ` Neil Brown 2005-11-16 17:54 ` RAID-6 Bill Davidsen 0 siblings, 1 reply; 20+ messages in thread From: Neil Brown @ 2005-11-13 10:05 UTC (permalink / raw) To: Brad Campbell; +Cc: RAID Linux On Sunday November 13, brad@wasp.net.au wrote: > G'day all, > > Here is an interesting question( well I think so in any case ). I just replaced a failed disk in my > 15 drive Raid-6. > > Simply mdadm --add /dev/md0 /dev/sdl > > Why, when there is no other activity on the array at all, is it writing to every disk during the > recovery? I would have assumed it just read from the others and > write to sdl. The raid6 recovery code always writes out the P and Q blocks for every stripe. This is un-necessary and there is in fact a comment in the code saying: /**** FIX: Should we really do both of these unconditionally? ****/ I recently reviewed and cleaned up this code, though I haven't tested the new version yet. I'll make sure the new code doesn't do un-necessary writes (it may already not). So there is a good chance that 2.6.16 will do a better job here. Thanks for the report, NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
* RAID-6 2005-11-13 10:05 ` Neil Brown @ 2005-11-16 17:54 ` Bill Davidsen 2005-11-16 20:39 ` RAID-6 Dan Stromberg 0 siblings, 1 reply; 20+ messages in thread From: Bill Davidsen @ 2005-11-16 17:54 UTC (permalink / raw) To: Neil Brown; +Cc: RAID Linux Based on some google searching on RAID-6, I find that it seems to be used to describe two different things. One is very similar to RAID-5, but with two redundancy blocks per stripe, one XOR and one CRC (or at any rate two methods are employed). The other sources define RAID-6 as RAID-5 with a distributed hot spare, AKA RAID-5E, which spreads head motion to all drives for performance. Any clarification on this? -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2005-11-16 17:54 ` RAID-6 Bill Davidsen @ 2005-11-16 20:39 ` Dan Stromberg 2005-12-29 18:29 ` RAID-6 H. Peter Anvin 0 siblings, 1 reply; 20+ messages in thread From: Dan Stromberg @ 2005-11-16 20:39 UTC (permalink / raw) To: Bill Davidsen; +Cc: Neil Brown, RAID Linux, strombrg My understanding is that RAID 5 -always- stripes parity. If it didn't, I believe it would be RAID 4. You may find http://linux.cudeso.be/raid.php of interest. I don't think RAID level 6 was in the original RAID paper, so vendors may have decided on their own that it should mean what they're selling. :) On Wed, 2005-11-16 at 12:54 -0500, Bill Davidsen wrote: > Based on some google searching on RAID-6, I find that it seems to be > used to describe two different things. One is very similar to RAID-5, > but with two redundancy blocks per stripe, one XOR and one CRC (or at > any rate two methods are employed). The other sources define RAID-6 as > RAID-5 with a distributed hot spare, AKA RAID-5E, which spreads head > motion to all drives for performance. > > Any clarification on this? > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2005-11-16 20:39 ` RAID-6 Dan Stromberg @ 2005-12-29 18:29 ` H. Peter Anvin 0 siblings, 0 replies; 20+ messages in thread From: H. Peter Anvin @ 2005-12-29 18:29 UTC (permalink / raw) To: linux-raid Followup to: <1132173592.23464.459.camel@seki.nac.uci.edu> By author: Dan Stromberg <strombrg@dcs.nac.uci.edu> In newsgroup: linux.dev.raid > > > My understanding is that RAID 5 -always- stripes parity. If it didn't, > I believe it would be RAID 4. > > You may find http://linux.cudeso.be/raid.php of interest. > > I don't think RAID level 6 was in the original RAID paper, so vendors > may have decided on their own that it should mean what they're > selling. :) > RAID-6 wasn't in the original RAID paper, but the term RAID-6 with the P+Q parity defintion is by far the dominant use of the term, and I believe it is/was recognized by the RAID Advisory Board, which is as close as you can get to an official statement. The RAB seems to have gotten defunct, with a standard squatter page on their previous web address. -hpa ^ permalink raw reply [flat|nested] 20+ messages in thread
* RAID-6
@ 2002-11-11 18:52 H. Peter Anvin
2002-11-11 21:06 ` RAID-6 Derek Vadala
` (2 more replies)
0 siblings, 3 replies; 20+ messages in thread
From: H. Peter Anvin @ 2002-11-11 18:52 UTC (permalink / raw)
To: linux-raid
Hi all,
I'm playing around with RAID-6 algorithms lately. With RAID-6 I mean
a setup which needs N+2 disks for N disks worth of storage and can
handle any two disks failing -- this seems to be the contemporary
definition of RAID-6 (the originally proposed "two-dimensional parity"
which required N+2*sqrt(N) drives never took off for obvious reasons.)
Based on my current research, I think the following should be true:
a) write performance will be worse than RAID-5, but I believe it can
be kept to within a factor of 1.5-2.0 on machines with suitable
SIMD instruction sets (e.g. MMX or SSE-2);
b) read performance in normal and single failure degraded mode will be
comparable to RAID-5;
c) read performance in dual failure degraded mode will be quite bad.
I'm curious how much interest there would be in this, since I
certainly have enough projects without it, and I'm probably going to
need some of Neil's time to integrate it into the md driver and the
tools.
-hpa
--
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: RAID-6 2002-11-11 18:52 RAID-6 H. Peter Anvin @ 2002-11-11 21:06 ` Derek Vadala 2002-11-11 22:44 ` RAID-6 Mr. James W. Laferriere 2002-11-12 16:22 ` RAID-6 Jakob Oestergaard 2 siblings, 0 replies; 20+ messages in thread From: Derek Vadala @ 2002-11-11 21:06 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-raid On 11 Nov 2002, H. Peter Anvin wrote: > I'm curious how much interest there would be in this, since I > certainly have enough projects without it, and I'm probably going to > need some of Neil's time to integrate it into the md driver and the > tools. There was quite a long thread about this last June ( it starts here http://marc.theaimsgroup.com/?l=linux-raid&m=102305890732421&w=2). I've seen lack of RAID-6 support cited as one of the shortcoming of Linux SW RAID quite a few times, by quite a few sources. That seems like one reason to implement it. -- Derek Vadala, derek@cynicism.com, http://www.cynicism.com/~derek ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-11 18:52 RAID-6 H. Peter Anvin 2002-11-11 21:06 ` RAID-6 Derek Vadala @ 2002-11-11 22:44 ` Mr. James W. Laferriere 2002-11-11 23:05 ` RAID-6 H. Peter Anvin 2002-11-12 16:22 ` RAID-6 Jakob Oestergaard 2 siblings, 1 reply; 20+ messages in thread From: Mr. James W. Laferriere @ 2002-11-11 22:44 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-raid Hello Peter , On 11 Nov 2002, H. Peter Anvin wrote: > Hi all, > I'm playing around with RAID-6 algorithms lately. With RAID-6 I mean > a setup which needs N+2 disks for N disks worth of storage and can > handle any two disks failing -- this seems to be the contemporary > definition of RAID-6 (the originally proposed "two-dimensional parity" > which required N+2*sqrt(N) drives never took off for obvious reasons.) Was there a discussion of the 'two-dimensional parity' on the list ? I don't remember any (of course) . But what other than 98+2+10 , What was the main difficulty ? I don't (personally) see any difficulty (other than managability/power/space) to the ammount of disks required . Tia , JimL -- +------------------------------------------------------------------+ | James W. Laferriere | System Techniques | Give me VMS | | Network Engineer | P.O. Box 854 | Give me Linux | | babydr@baby-dragons.com | Coudersport PA 16915 | only on AXP | +------------------------------------------------------------------+ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-11 22:44 ` RAID-6 Mr. James W. Laferriere @ 2002-11-11 23:05 ` H. Peter Anvin 0 siblings, 0 replies; 20+ messages in thread From: H. Peter Anvin @ 2002-11-11 23:05 UTC (permalink / raw) To: Mr. James W. Laferriere; +Cc: linux-raid Mr. James W. Laferriere wrote: > Hello Peter , > > On 11 Nov 2002, H. Peter Anvin wrote: > >>Hi all, >>I'm playing around with RAID-6 algorithms lately. With RAID-6 I mean >>a setup which needs N+2 disks for N disks worth of storage and can >>handle any two disks failing -- this seems to be the contemporary >>definition of RAID-6 (the originally proposed "two-dimensional parity" >>which required N+2*sqrt(N) drives never took off for obvious reasons.) > > Was there a discussion of the 'two-dimensional parity' on the > list ? I don't remember any (of course) . But what other than > 98+2+10 , What was the main difficulty ? I don't (personally) > see any difficulty (other than managability/power/space) to the > ammount of disks required . Tia , JimL > No discussion of two-dimensional parity, but that was the originally proposed RAID-6. Noone ever productized a solution like that to the best of my knowledge. I don't know what you mean with "98+2+10", but the basic problem is that with 2D parity, for N data drives you need 2*sqrt(N) redundancy drives, which for any moderate-sized RAID is a lot (with 9 data drives you need 6 redundancy drives, so you have 67% overhead.) You also have the same kind of performance problems as RAID-4 does, because the "rotating parity" trick of RAID-5 does not work in two dimensions. And for all of this, you're not *guaranteed* more than dual failure recovery (although you might, probabilistically, luck out and have more than that.) P+Q redundancy, the current meaning of RAID-6, instead uses two orthogonal redundancy functions so you only need two redundancy drives regardless of how much data you have, and you can apply the RAID-5 trick of rotating the parity around. So from your 15 drives in the example above, you get 13 drives worth of data instead of 9. -hpa ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-11 18:52 RAID-6 H. Peter Anvin 2002-11-11 21:06 ` RAID-6 Derek Vadala 2002-11-11 22:44 ` RAID-6 Mr. James W. Laferriere @ 2002-11-12 16:22 ` Jakob Oestergaard 2002-11-12 16:30 ` RAID-6 H. Peter Anvin 2002-11-12 19:37 ` RAID-6 Neil Brown 2 siblings, 2 replies; 20+ messages in thread From: Jakob Oestergaard @ 2002-11-12 16:22 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-raid On Mon, Nov 11, 2002 at 10:52:36AM -0800, H. Peter Anvin wrote: > Hi all, > > I'm playing around with RAID-6 algorithms lately. With RAID-6 I mean > a setup which needs N+2 disks for N disks worth of storage and can > handle any two disks failing -- this seems to be the contemporary > definition of RAID-6 (the originally proposed "two-dimensional parity" > which required N+2*sqrt(N) drives never took off for obvious reasons.) > > Based on my current research, I think the following should be true: > > a) write performance will be worse than RAID-5, but I believe it can > be kept to within a factor of 1.5-2.0 on machines with suitable > SIMD instruction sets (e.g. MMX or SSE-2); Please note that raw CPU power is usually *not* a limiting (or even significantly contributing) factor, on modern systems. Limitations are disk reads/writes/seeks, bus bandwidth, etc. You will probably cause more bus activity with RAID-6, and that might degrade performance. But I don't think you need to worry about MMX/SSE/... If you can do as well as the current RAID-5 code, then you will be in the clear until people have 1GB/sec disk transfer-rates on 500MHz PIII systems ;) > > b) read performance in normal and single failure degraded mode will be > comparable to RAID-5; Which again is like a RAID-0 with some extra seeks... Eg. not too bad with huge chunk sizes. You might want to consider using huge chunk-sizes when reading, but making sure that writes can be made on "sub-chunks" - so that one could run a RAID-6 with a 128k chunk size, yet have writes performed on 4k chunks. This is important for performance on both read and write, but it is an optimization the current RAID-5 code lacks. > > c) read performance in dual failure degraded mode will be quite bad. > > I'm curious how much interest there would be in this, since I > certainly have enough projects without it, and I'm probably going to > need some of Neil's time to integrate it into the md driver and the > tools. I've seen quite some people ask for it. You might find a friend in "Roy Sigurd Karlsbach" - he for one has been asking (loudly) for it ;) Go Peter! ;) -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-12 16:22 ` RAID-6 Jakob Oestergaard @ 2002-11-12 16:30 ` H. Peter Anvin 2002-11-12 19:01 ` RAID-6 H. Peter Anvin 2002-11-12 19:37 ` RAID-6 Neil Brown 1 sibling, 1 reply; 20+ messages in thread From: H. Peter Anvin @ 2002-11-12 16:30 UTC (permalink / raw) To: Jakob Oestergaard; +Cc: linux-raid Jakob Oestergaard wrote: >> >>a) write performance will be worse than RAID-5, but I believe it can >> be kept to within a factor of 1.5-2.0 on machines with suitable >> SIMD instruction sets (e.g. MMX or SSE-2); > > Please note that raw CPU power is usually *not* a limiting (or even > significantly contributing) factor, on modern systems. > > Limitations are disk reads/writes/seeks, bus bandwidth, etc. > > You will probably cause more bus activity with RAID-6, and that might > degrade performance. But I don't think you need to worry about > MMX/SSE/... If you can do as well as the current RAID-5 code, then you > will be in the clear until people have 1GB/sec disk transfer-rates on > 500MHz PIII systems ;) > RAID-6 will, obviously, never do as well as RAID-5 -- you are doing more work (both computational and data-pushing.) The RAID-6 syndrome computation is actually extrememly expensive if you can't do it in parallel. Fortunately there is a way to do it in parallel using MMX or SSE-2, although it seems to exist by pure dumb luck -- certainly not by design. I've tried to figure out how to generalize to using regular 32-bit or 64-bit integer registers, but it doesn't seem to work there. Again, my initial analysis seems to indicate performance within about a factor of 2. >>b) read performance in normal and single failure degraded mode will be >> comparable to RAID-5; > > Which again is like a RAID-0 with some extra seeks... Eg. not too bad > with huge chunk sizes. > > You might want to consider using huge chunk-sizes when reading, but > making sure that writes can be made on "sub-chunks" - so that one could > run a RAID-6 with a 128k chunk size, yet have writes performed on 4k > chunks. This is important for performance on both read and write, but > it is an optimization the current RAID-5 code lacks. That's an issue for the common framework, I'll leave that to Neil. It's functionally equivalent between RAID-5 and -6. >>c) read performance in dual failure degraded mode will be quite bad. >> >>I'm curious how much interest there would be in this, since I >>certainly have enough projects without it, and I'm probably going to >>need some of Neil's time to integrate it into the md driver and the >>tools. > > I've seen quite some people ask for it. You might find a friend in "Roy > Sigurd Karlsbach" - he for one has been asking (loudly) for it ;) :) Enough people have responded that I think I have a project... -hpa ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-12 16:30 ` RAID-6 H. Peter Anvin @ 2002-11-12 19:01 ` H. Peter Anvin 0 siblings, 0 replies; 20+ messages in thread From: H. Peter Anvin @ 2002-11-12 19:01 UTC (permalink / raw) To: linux-raid Followup to: <3DD12CA3.5090105@zytor.com> By author: "H. Peter Anvin" <hpa@zytor.com> In newsgroup: linux.dev.raid > > I've tried to figure out how to generalize to using regular 32-bit > or 64-bit integer registers, but it doesn't seem to work there. > Of course, once I had said that I just had to go and figure it out :) This is good, because it's inherently the highest performance we can get out of portable code. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-12 16:22 ` RAID-6 Jakob Oestergaard 2002-11-12 16:30 ` RAID-6 H. Peter Anvin @ 2002-11-12 19:37 ` Neil Brown 2002-11-13 2:13 ` RAID-6 Jakob Oestergaard 1 sibling, 1 reply; 20+ messages in thread From: Neil Brown @ 2002-11-12 19:37 UTC (permalink / raw) To: Jakob Oestergaard; +Cc: H. Peter Anvin, linux-raid On Tuesday November 12, jakob@unthought.net wrote: > > You might want to consider using huge chunk-sizes when reading, but > making sure that writes can be made on "sub-chunks" - so that one could > run a RAID-6 with a 128k chunk size, yet have writes performed on 4k > chunks. This is important for performance on both read and write, but > it is an optimization the current RAID-5 code lacks. Either I misunderstand your point, or you misunderstand the code. A 4k write request will cause a 4k write to a data block and a 4k write to a parity block, no matter what the chunk size is. (There may also be pre-reading, and possibly several 4k writes will share a parity block update). I see no lacking optimisation, but if you do, I would be keen to hear a more detailed explanation. NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-12 19:37 ` RAID-6 Neil Brown @ 2002-11-13 2:13 ` Jakob Oestergaard 2002-11-13 3:33 ` RAID-6 Neil Brown 0 siblings, 1 reply; 20+ messages in thread From: Jakob Oestergaard @ 2002-11-13 2:13 UTC (permalink / raw) To: Neil Brown; +Cc: H. Peter Anvin, linux-raid On Wed, Nov 13, 2002 at 06:37:40AM +1100, Neil Brown wrote: > On Tuesday November 12, jakob@unthought.net wrote: > > > > You might want to consider using huge chunk-sizes when reading, but > > making sure that writes can be made on "sub-chunks" - so that one could > > run a RAID-6 with a 128k chunk size, yet have writes performed on 4k > > chunks. This is important for performance on both read and write, but > > it is an optimization the current RAID-5 code lacks. > > Either I misunderstand your point, or you misunderstand the code. > > A 4k write request will cause a 4k write to a data block and a 4k > write to a parity block, no matter what the chunk size is. (There may > also be pre-reading, and possibly several 4k writes will share a > parity block update). Writes on a 128k chunk array are significantly slower than writes on a 4k chunk array, according to someone else on this list - I wanted to look into this myself, but now is just a bad time for me (nothing new on that front). The benchmark goes: | some tests on raid5 with 4k and 128k chunk size. The results are as follows: | Access Spec 4K(MBps) 4K-deg(MBps) 128K(MBps) 128K-deg(MBps) | 2K Seq Read 23.015089 33.293993 25.415035 32.669278 | 2K Seq Write 27.363041 30.555328 14.185889 16.087862 | 64K Seq Read 22.952559 44.414774 26.02711 44.036993 | 64K Seq Write 25.171833 32.67759 13.97861 15.618126 So down from 27MB/sec to 14MB/sec running 2k-block sequential writes on a 128k chunk array versus a 4k chunk array (non-degraded). In degraded mode, the writes degenerate from 30MB/sec to 16MB/sec as the chunk-size increases. Something's fishy. > > I see no lacking optimisation, but if you do, I would be keen to hear > a more detailed explanation. Well if a 4k write really only causes a 4k write to disk, even with a 128k chunk-size array, then something else is happening... I didn't do the benchmark, and I didn't get to investigate it further here, so I can't really say much else productive :) / Jakob "linux-raid message multiplexer" Østergaard -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-13 2:13 ` RAID-6 Jakob Oestergaard @ 2002-11-13 3:33 ` Neil Brown 2002-11-13 12:29 ` RAID-6 Jakob Oestergaard 0 siblings, 1 reply; 20+ messages in thread From: Neil Brown @ 2002-11-13 3:33 UTC (permalink / raw) To: Jakob Oestergaard; +Cc: H. Peter Anvin, linux-raid On Wednesday November 13, jakob@unthought.net wrote: > > Writes on a 128k chunk array are significantly slower than writes on a > 4k chunk array, according to someone else on this list - I wanted to > look into this myself, but now is just a bad time for me (nothing new > on that front). > > The benchmark goes: > > | some tests on raid5 with 4k and 128k chunk size. The results are as follows: > | Access Spec 4K(MBps) 4K-deg(MBps) 128K(MBps) 128K-deg(MBps) > | 2K Seq Read 23.015089 33.293993 25.415035 32.669278 > | 2K Seq Write 27.363041 30.555328 14.185889 16.087862 > | 64K Seq Read 22.952559 44.414774 26.02711 44.036993 > | 64K Seq Write 25.171833 32.67759 13.97861 15.618126 > > So down from 27MB/sec to 14MB/sec running 2k-block sequential writes on > a 128k chunk array versus a 4k chunk array (non-degraded). When doing sequential writes, a small chunk size means you are more likely to fill up a whole stripe before data is flushed to disk, so it is very possible that you wont need to pre-read parity at all. With a larger chunksize, it is more likely that you will have to write, and possibly read, the parity block several times. So if you are doing single threaded sequential accesses, a smaller chunk size is definately better. If you are doing lots of parallel accesses (typical multi-user work load), small chunk sizes tends to mean that every access goes to all drives so there is lots of contention. In theory a larger chunk size means that more accesses will be entirely satisfied from just one disk, so there it more opportunity for concurrency between the different users. As always, the best way to choose a chunk size is develop a realistic work load and test it against several different chunk sizes. There is no rule like "bigger is better" or "smaller is better". NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-13 3:33 ` RAID-6 Neil Brown @ 2002-11-13 12:29 ` Jakob Oestergaard 2002-11-13 17:33 ` RAID-6 H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Jakob Oestergaard @ 2002-11-13 12:29 UTC (permalink / raw) To: Neil Brown; +Cc: H. Peter Anvin, linux-raid On Wed, Nov 13, 2002 at 02:33:46PM +1100, Neil Brown wrote: ... > > The benchmark goes: > > > > | some tests on raid5 with 4k and 128k chunk size. The results are as follows: > > | Access Spec 4K(MBps) 4K-deg(MBps) 128K(MBps) 128K-deg(MBps) > > | 2K Seq Read 23.015089 33.293993 25.415035 32.669278 > > | 2K Seq Write 27.363041 30.555328 14.185889 16.087862 > > | 64K Seq Read 22.952559 44.414774 26.02711 44.036993 > > | 64K Seq Write 25.171833 32.67759 13.97861 15.618126 > > > > So down from 27MB/sec to 14MB/sec running 2k-block sequential writes on > > a 128k chunk array versus a 4k chunk array (non-degraded). > > When doing sequential writes, a small chunk size means you are more > likely to fill up a whole stripe before data is flushed to disk, so it > is very possible that you wont need to pre-read parity at all. With a > larger chunksize, it is more likely that you will have to write, and > possibly read, the parity block several times. Except if one worked on 4k sub-chunks - right ? :) > > So if you are doing single threaded sequential accesses, a smaller > chunk size is definately better. Definitely not so for reads - the seeking past the parity blocks ruin sequential read performance when we do many such seeks (eg. when we have small chunks) - as witnessed by the benchmark data above. > If you are doing lots of parallel accesses (typical multi-user work > load), small chunk sizes tends to mean that every access goes to all > drives so there is lots of contention. In theory a larger chunk size > means that more accesses will be entirely satisfied from just one disk, > so there it more opportunity for concurrency between the different > users. > > As always, the best way to choose a chunk size is develop a realistic > work load and test it against several different chunk sizes. There > is no rule like "bigger is better" or "smaller is better". For a single reader/writer, it was pretty obvious from the above that "big is good" for reads (because of the fewer parity block skip seeks), and "small is good" for writes. So, by making a big chunk-sized array, and having it work on 4k sub-chunks for writes, was some idea I had which I felt would just give the best scenario in both cases. Am I smoking crack, or ? ;) -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-13 12:29 ` RAID-6 Jakob Oestergaard @ 2002-11-13 17:33 ` H. Peter Anvin 2002-11-13 18:07 ` RAID-6 Peter L. Ashford 2002-11-13 22:50 ` RAID-6 Neil Brown 2002-11-13 18:42 ` RAID-6 Peter L. Ashford 2002-11-13 22:48 ` RAID-6 Neil Brown 2 siblings, 2 replies; 20+ messages in thread From: H. Peter Anvin @ 2002-11-13 17:33 UTC (permalink / raw) To: Jakob Oestergaard; +Cc: Neil Brown, linux-raid Jakob Oestergaard wrote: >> >>When doing sequential writes, a small chunk size means you are more >>likely to fill up a whole stripe before data is flushed to disk, so it >>is very possible that you wont need to pre-read parity at all. With a >>larger chunksize, it is more likely that you will have to write, and >>possibly read, the parity block several times. > > Except if one worked on 4k sub-chunks - right ? :) > No. You probably want to look at the difference between RAID-3 and RAID-4 (RAID-5 being basically RAID-4 twisted around in a rotating pattern.) > > So, by making a big chunk-sized array, and having it work on 4k > sub-chunks for writes, was some idea I had which I felt would just give > the best scenario in both cases. > > Am I smoking crack, or ? ;) > No, you're confusing RAID-3 and RAID-4/5. In RAID-3, sequential blocks are organized as: DISKS ------------------------------------> 0 1 2 3 PARITY 4 5 6 7 PARITY 8 9 10 11 PARITY 12 13 14 15 PARITY ... whereas in RAID-4 with a chunk size of four blocks it's: DISKS ------------------------------------> 0 4 8 12 PARITY 1 5 9 13 PARITY 2 6 10 14 PARITY 3 7 11 15 PARITY If you only write blocks 0-3 you *have* to read in the 12 data blocks and write out all 4 parity blocks, whereas in RAID-3 you can get away with only writing 5 blocks. [Well, technically you could also do a read-modify-write on the parity, since parity is linear. This would greatly complicate the code.] Therefore, for small sequential writes chunking is *inherently* a lose, and there isn't much you can do about it. -hpa ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-13 17:33 ` RAID-6 H. Peter Anvin @ 2002-11-13 18:07 ` Peter L. Ashford 2002-11-13 22:50 ` RAID-6 Neil Brown 1 sibling, 0 replies; 20+ messages in thread From: Peter L. Ashford @ 2002-11-13 18:07 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-raid Peter, > No, you're confusing RAID-3 and RAID-4/5. In RAID-3, sequential blocks > are organized as: > > DISKS ------------------------------------> > 0 1 2 3 PARITY > 4 5 6 7 PARITY > 8 9 10 11 PARITY > 12 13 14 15 PARITY > > ... whereas in RAID-4 with a chunk size of four blocks it's: > > DISKS ------------------------------------> > 0 4 8 12 PARITY > 1 5 9 13 PARITY > 2 6 10 14 PARITY > 3 7 11 15 PARITY The description you have for RAID-3 is wrong. What you give as RAID-3 is actually RAID-4 with a 1 block segment size. RAID-3 uses BITWISE (or BYTEWISE) striping with parity, as opposed to the BLOCKWISE striping with parity in RAID-4/5. In RAID-3, every I/O transaction (regardless of size) accesses all drives in the array (a read doesn't have to access the parity). This gives high transfer rates, even on single-block transactions. DISKS ------------------------------------> 0.0 0.1 0.2 0.3 PARITY 1.0 1.1 1.2 1.3 PARITY 2.0 2.1 2.2 2.3 PARITY 3.0 3.1 3.2 3.3 PARITY It is NOT possible to read just 0.0 from the array. If you were to read the raw physical device, the results would be meaningless. The structure of the array limits the number of spindles to one more than a power of two (3, 5, 9, 17, etc.). I have seen this implemented with 5 and 9 drives (actually, the 9 was done with parallel heads). Peter Ashford ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-13 17:33 ` RAID-6 H. Peter Anvin 2002-11-13 18:07 ` RAID-6 Peter L. Ashford @ 2002-11-13 22:50 ` Neil Brown 1 sibling, 0 replies; 20+ messages in thread From: Neil Brown @ 2002-11-13 22:50 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Jakob Oestergaard, linux-raid On Wednesday November 13, hpa@zytor.com wrote: > > DISKS ------------------------------------> > 0 4 8 12 PARITY > 1 5 9 13 PARITY > 2 6 10 14 PARITY > 3 7 11 15 PARITY > > If you only write blocks 0-3 you *have* to read in the 12 data blocks > and write out all 4 parity blocks, whereas in RAID-3 you can get away > with only writing 5 blocks. [Well, technically you could also do a > read-modify-write on the parity, since parity is linear. This would > greatly complicate the code.] We do read-modify-write if it involves fewer pre-reads than reconstruct-write. so in the above scenario, writing blocks 0,1,2,3 would cause a pre-read of those blocks and the 4 parity blocks, and then all 8 blocks would be re-written. NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-13 12:29 ` RAID-6 Jakob Oestergaard 2002-11-13 17:33 ` RAID-6 H. Peter Anvin @ 2002-11-13 18:42 ` Peter L. Ashford 2002-11-13 22:48 ` RAID-6 Neil Brown 2 siblings, 0 replies; 20+ messages in thread From: Peter L. Ashford @ 2002-11-13 18:42 UTC (permalink / raw) To: Jakob Oestergaard; +Cc: linux-raid Jakob, SNIP > For a single reader/writer, it was pretty obvious from the above that > "big is good" for reads (because of the fewer parity block skip seeks), > and "small is good" for writes. > > So, by making a big chunk-sized array, and having it work on 4k > sub-chunks for writes, was some idea I had which I felt would just give > the best scenario in both cases. Actually, the problem is worse than you describe. Let's assume that we have a RAID-5 array of 5 disks, with a segment size of 64KB. In this instance, the optimum I/O size will be 256KB. Furthermore, that will only be the optimum I/O when it is on a 256KB boundary. I have, in the past, performed I/O benchmarks on raw arrays (both using the MD driver, and using 3Ware cards). My results show that read speed drops off when the segment size passes 128KB, but write speed stays stable up to 2MB (the largest I/O size I tested). This information, combined with the benchmarks you posted earlier, shows that the write slowdown when writing large I/O sizes is caused by the file-system structure. Current Linux file-systems don't support block sizes larger than 4KB. This means that even if you perform the optimum sized I/O, there is no guarantee that the I/O will occur on the optimum boundary (it's actually quite unlikely). To make matters worse, there is no guarantee that when you perform a large write, all the data will be placed in contiguous blocks. In order to maximize I/O throughput, it will be necessary to create a Linux file-system that can effectively deal with large blocks (not necessarily power of two in size). The alternative would be to work with the raw file-system, as many DBMS' do. I have worked with a file-system structure that deals well with large blocks, but it is not in the public domain, and I doubt that CRAY is interested in porting the NC1FS structure to Linux. Peter Ashford ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: RAID-6 2002-11-13 12:29 ` RAID-6 Jakob Oestergaard 2002-11-13 17:33 ` RAID-6 H. Peter Anvin 2002-11-13 18:42 ` RAID-6 Peter L. Ashford @ 2002-11-13 22:48 ` Neil Brown 2 siblings, 0 replies; 20+ messages in thread From: Neil Brown @ 2002-11-13 22:48 UTC (permalink / raw) To: Jakob Oestergaard; +Cc: H. Peter Anvin, linux-raid On Wednesday November 13, jakob@unthought.net wrote: > On Wed, Nov 13, 2002 at 02:33:46PM +1100, Neil Brown wrote: > ... > > > The benchmark goes: > > > > > > | some tests on raid5 with 4k and 128k chunk size. The results are as follows: > > > | Access Spec 4K(MBps) 4K-deg(MBps) 128K(MBps) 128K-deg(MBps) > > > | 2K Seq Read 23.015089 33.293993 25.415035 32.669278 > > > | 2K Seq Write 27.363041 30.555328 14.185889 16.087862 > > > | 64K Seq Read 22.952559 44.414774 26.02711 44.036993 > > > | 64K Seq Write 25.171833 32.67759 13.97861 15.618126 > > > These numbers look ... interesting. I might try to reproduce them myself. > > > So down from 27MB/sec to 14MB/sec running 2k-block sequential writes on > > > a 128k chunk array versus a 4k chunk array (non-degraded). > > > > When doing sequential writes, a small chunk size means you are more > > likely to fill up a whole stripe before data is flushed to disk, so it > > is very possible that you wont need to pre-read parity at all. With a > > larger chunksize, it is more likely that you will have to write, and > > possibly read, the parity block several times. > > Except if one worked on 4k sub-chunks - right ? :) I still don't understand.... We *do* work with 4k subchunks. > > > > > So if you are doing single threaded sequential accesses, a smaller > > chunk size is definately better. > > Definitely not so for reads - the seeking past the parity blocks ruin > sequential read performance when we do many such seeks (eg. when we have > small chunks) - as witnessed by the benchmark data above. Parity blocks aren't big enough to have to seek past. I would imagine that a modern drive would read a whole track into cache on the first read request, and then find the required data, just past the parity block, in the cache on the second request. By maybe I'm wrong. Or there could be some factor in the device driver where lots of little read request, even though they are almost consecutive, are handled more poorly than a few large read requests. I wonder if it would be worth reading those parity blocks anyway if a sequential read were detected.... > > > If you are doing lots of parallel accesses (typical multi-user work > > load), small chunk sizes tends to mean that every access goes to all > > drives so there is lots of contention. In theory a larger chunk size > > means that more accesses will be entirely satisfied from just one disk, > > so there it more opportunity for concurrency between the different > > users. > > > > As always, the best way to choose a chunk size is develop a realistic > > work load and test it against several different chunk sizes. There > > is no rule like "bigger is better" or "smaller is better". > > For a single reader/writer, it was pretty obvious from the above that > "big is good" for reads (because of the fewer parity block skip seeks), > and "small is good" for writes. > > So, by making a big chunk-sized array, and having it work on 4k > sub-chunks for writes, was some idea I had which I felt would just give > the best scenario in both cases. The issue isn't so much the IO size as the layout of disk. You cannot use one layout for read and a different layout for write. That obviously doesn't make sense. NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2005-12-29 18:29 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Pine.GSO.4.30.0211111138080.15590-100000@multivac.sdsc.edu>
2002-11-11 19:47 ` RAID-6 H. Peter Anvin
2005-11-13 9:05 Raid-6 Rebuild question Brad Campbell
2005-11-13 10:05 ` Neil Brown
2005-11-16 17:54 ` RAID-6 Bill Davidsen
2005-11-16 20:39 ` RAID-6 Dan Stromberg
2005-12-29 18:29 ` RAID-6 H. Peter Anvin
-- strict thread matches above, loose matches on Subject: below --
2002-11-11 18:52 RAID-6 H. Peter Anvin
2002-11-11 21:06 ` RAID-6 Derek Vadala
2002-11-11 22:44 ` RAID-6 Mr. James W. Laferriere
2002-11-11 23:05 ` RAID-6 H. Peter Anvin
2002-11-12 16:22 ` RAID-6 Jakob Oestergaard
2002-11-12 16:30 ` RAID-6 H. Peter Anvin
2002-11-12 19:01 ` RAID-6 H. Peter Anvin
2002-11-12 19:37 ` RAID-6 Neil Brown
2002-11-13 2:13 ` RAID-6 Jakob Oestergaard
2002-11-13 3:33 ` RAID-6 Neil Brown
2002-11-13 12:29 ` RAID-6 Jakob Oestergaard
2002-11-13 17:33 ` RAID-6 H. Peter Anvin
2002-11-13 18:07 ` RAID-6 Peter L. Ashford
2002-11-13 22:50 ` RAID-6 Neil Brown
2002-11-13 18:42 ` RAID-6 Peter L. Ashford
2002-11-13 22:48 ` RAID-6 Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).