From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Triple-parity raid6 Date: Thu, 9 Jun 2011 22:04:38 +1000 Message-ID: <20110609220438.26336b27@notabene.brown> References: <20110609114954.243e9e22@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: David Brown Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids On Thu, 09 Jun 2011 13:32:59 +0200 David Brown = wrote: > On 09/06/2011 03:49, NeilBrown wrote: > > On Thu, 09 Jun 2011 02:01:06 +0200 David Brown > > wrote: > > > >> Has anyone considered triple-parity raid6 ? As far as I can see, = it > >> should not be significantly harder than normal raid6 - either to > >> implement, or for the processor at run-time. Once you have the GF= (2=E2=81=B8) > >> field arithmetic in place for raid6, it's just a matter of making > >> another parity block in the same way but using a different generat= or: > >> > >> P =3D D_0 + D_1 + D_2 + .. + D_(n.1) > >> Q =3D D_0 + g.D_1 + g=C2=B2.D_2 + .. + g^(n-1).D_(n.1) > >> R =3D D_0 + h.D_1 + h=C2=B2.D_2 + .. + h^(n-1).D_(n.1) > >> > >> The raid6 implementation in mdraid uses g =3D 0x02 to generate the= second > >> parity (based on "The mathematics of RAID-6" - I haven't checked t= he > >> source code). You can make a third parity using h =3D 0x04 and th= en get a > >> redundancy of 3 disks. (Note - I haven't yet confirmed that this = is > >> valid for more than 100 data disks - I need to make my checker pro= gram > >> more efficient first.) > >> > >> Rebuilding a disk, or running in degraded mode, is just an obvious > >> extension to the current raid6 algorithms. If you are missing thr= ee > >> data blocks, the maths looks hard to start with - but if you expre= ss the > >> equations as a set of linear equations and use standard matrix inv= ersion > >> techniques, it should not be hard to implement. You only need to = do > >> this inversion once when you find that one or more disks have fail= ed - > >> then you pre-compute the multiplication tables in the same way as = is > >> done for raid6 today. > >> > >> In normal use, calculating the R parity is no more demanding than > >> calculating the Q parity. And most rebuilds or degraded situation= s will > >> only involve a single disk, and the data can thus be re-constructe= d > >> using the P parity just like raid5 or two-parity raid6. > >> > >> > >> I'm sure there are situations where triple-parity raid6 would be > >> appealing - it has already been implemented in ZFS, and it is only= a > >> matter of time before two-parity raid6 has a real probability of h= itting > >> an unrecoverable read error during a rebuild. > >> > >> > >> And of course, there is no particular reason to stop at three pari= ty > >> blocks - the maths can easily be generalised. 1, 2, 4 and 8 can b= e used > >> as generators for quad-parity (checked up to 60 disks), and adding= 16 > >> gives you quintuple parity (checked up to 30 disks) - but that's m= aybe > >> getting a bit paranoid. > >> > >> > >> ref.: > >> > >> > >> > >> > >> > >> > > > > -ENOPATCH :-) > > > > I have a series of patches nearly ready which removes a lot of the = remaining > > duplication in raid5.c between raid5 and raid6 paths. So there wil= l be > > relative few places where RAID5 and RAID6 do different things - onl= y the > > places where they *must* do different things. > > After that, adding a new level or layout which has 'max_degraded =3D= =3D 3' would > > be quite easy. > > The most difficult part would be the enhancements to libraid6 to ge= nerate the > > new 'syndrome', and to handle the different recovery possibilities. > > > > So if you're not otherwise busy this weekend, a patch would be nice= :-) > > >=20 > I'm not going to promise any patches, but maybe I can help with the=20 > maths. You say the difficult part is the syndrome calculations and=20 > recovery - I've got these bits figured out on paper and some=20 > quick-and-dirty python test code. On the other hand, I don't really=20 > want to get into the md kernel code, or the mdadm code - I haven't do= ne=20 > Linux kernel development before (I mostly program 8-bit microcontroll= ers=20 > - when I code on Linux, I use Python), and I fear it would take me a=20 > long time to get up to speed. >=20 > However, if the parity generation and recovery is neatly separated in= to=20 > a libraid6 library, the whole thing becomes much more tractable from = my=20 > viewpoint. Since I am new to this, can you tell me where I should ge= t=20 > the current libraid6 code? I'm sure google will find some sources fo= r=20 > me, but I'd like to make sure I start with whatever version /you/ hav= e. >=20 >=20 >=20 >=20 > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid"= in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html You can see the current kernel code at: http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2.6.git;a=3D= tree;f=3Dlib/raid6;h=3D970c541a452d3b9983223d74b10866902f1a47c7;hb=3DHE= AD int.uc is the generic C code which 'unroll.awk' processes to make vario= us versions that unroll the loops different amounts to work with CPUs with different numbers of registers. Then there is sse1, sse2, altivec which provide the same functionality = in assembler which is optimised for various processors. And 'recov' has the smarts for doing the reverse calculation when 2 dat= a blocks, or 1 data and P are missing. Even if you don't feel up to implementing everything, a start might be useful. You never know when someone might jump up and offer to help. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html