Re: Triple-parity raid6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Brown <david.brown@hesbynett.no>
To: linux-raid@vger.kernel.org
Subject: Re: Triple-parity raid6
Date: Thu, 09 Jun 2011 21:19:35 +0200	[thread overview]
Message-ID: <isr6c7$7pu$1@dough.gmane.org> (raw)
In-Reply-To: <20110609220438.26336b27@notabene.brown>

On 09/06/11 14:04, NeilBrown wrote:
> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown<david@westcontrol.com>  wrote:
>
>> On 09/06/2011 03:49, NeilBrown wrote:
>>> On Thu, 09 Jun 2011 02:01:06 +0200 David Brown<david.brown@hesbynett.no>
>>> wrote:
>>>
>>>> Has anyone considered triple-parity raid6 ?  As far as I can see, it
>>>> should not be significantly harder than normal raid6 - either  to
>>>> implement, or for the processor at run-time.  Once you have the GF(2⁸)
>>>> field arithmetic in place for raid6, it's just a matter of making
>>>> another parity block in the same way but using a different generator:
>>>>
>>>> P = D_0 + D_1 + D_2 + .. + D_(n.1)
>>>> Q = D_0 + g.D_1 + g².D_2 + .. + g^(n-1).D_(n.1)
>>>> R = D_0 + h.D_1 + h².D_2 + .. + h^(n-1).D_(n.1)
>>>>
>>>> The raid6 implementation in mdraid uses g = 0x02 to generate the second
>>>> parity (based on "The mathematics of RAID-6" - I haven't checked the
>>>> source code).  You can make a third parity using h = 0x04 and then get a
>>>> redundancy of 3 disks.  (Note - I haven't yet confirmed that this is
>>>> valid for more than 100 data disks - I need to make my checker program
>>>> more efficient first.)
>>>>
>>>> Rebuilding a disk, or running in degraded mode, is just an obvious
>>>> extension to the current raid6 algorithms.  If you are missing three
>>>> data blocks, the maths looks hard to start with - but if you express the
>>>> equations as a set of linear equations and use standard matrix inversion
>>>> techniques, it should not be hard to implement.  You only need to do
>>>> this inversion once when you find that one or more disks have failed -
>>>> then you pre-compute the multiplication tables in the same way as is
>>>> done for raid6 today.
>>>>
>>>> In normal use, calculating the R parity is no more demanding than
>>>> calculating the Q parity.  And most rebuilds or degraded situations will
>>>> only involve a single disk, and the data can thus be re-constructed
>>>> using the P parity just like raid5 or two-parity raid6.
>>>>
>>>>
>>>> I'm sure there are situations where triple-parity raid6 would be
>>>> appealing - it has already been implemented in ZFS, and it is only a
>>>> matter of time before two-parity raid6 has a real probability of hitting
>>>> an unrecoverable read error during a rebuild.
>>>>
>>>>
>>>> And of course, there is no particular reason to stop at three parity
>>>> blocks - the maths can easily be generalised.  1, 2, 4 and 8 can be used
>>>> as generators for quad-parity (checked up to 60 disks), and adding 16
>>>> gives you quintuple parity (checked up to 30 disks) - but that's maybe
>>>> getting a bit paranoid.
>>>>
>>>>
>>>> ref.:
>>>>
>>>> <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>
>>>> <http://blogs.oracle.com/ahl/entry/acm_triple_parity_raid>
>>>> <http://queue.acm.org/detail.cfm?id=1670144>
>>>> <http://blogs.oracle.com/ahl/entry/triple_parity_raid_z>
>>>>
>>>
>>>    -ENOPATCH  :-)
>>>
>>> I have a series of patches nearly ready which removes a lot of the remaining
>>> duplication in raid5.c between raid5 and raid6 paths.  So there will be
>>> relative few places where RAID5 and RAID6 do different things - only the
>>> places where they *must* do different things.
>>> After that, adding a new level or layout which has 'max_degraded == 3' would
>>> be quite easy.
>>> The most difficult part would be the enhancements to libraid6 to generate the
>>> new 'syndrome', and to handle the different recovery possibilities.
>>>
>>> So if you're not otherwise busy this weekend, a patch would be nice :-)
>>>
>>
>> I'm not going to promise any patches, but maybe I can help with the
>> maths.  You say the difficult part is the syndrome calculations and
>> recovery - I've got these bits figured out on paper and some
>> quick-and-dirty python test code.  On the other hand, I don't really
>> want to get into the md kernel code, or the mdadm code - I haven't done
>> Linux kernel development before (I mostly program 8-bit microcontrollers
>> - when I code on Linux, I use Python), and I fear it would take me a
>> long time to get up to speed.
>>
>> However, if the parity generation and recovery is neatly separated into
>> a libraid6 library, the whole thing becomes much more tractable from my
>> viewpoint.  Since I am new to this, can you tell me where I should get
>> the current libraid6 code?  I'm sure google will find some sources for
>> me, but I'd like to make sure I start with whatever version /you/ have.
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> You can see the current kernel code at:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f1a47c7;hb=HEAD
>
>
> int.uc is the generic C code which 'unroll.awk' processes to make various
> versions that unroll the loops different amounts to work with CPUs with
> different numbers of registers.
> Then there is sse1, sse2, altivec which provide the same functionality in
> assembler which is optimised for various processors.
>
> And 'recov' has the smarts for doing the reverse calculation when 2 data
> blocks, or 1 data and P are missing.
>
> Even if you don't feel up to implementing everything, a start might be
> useful.  You never know when someone might jump up and offer to help.
>
> NeilBrown

Monday is a holiday here in Norway, so I've got a long weekend.  I 
should get at least /some/ time to have a look at libraid6!

mvh.,

David


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2011-06-09 19:19 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-09  0:01 Triple-parity raid6 David Brown
2011-06-09  1:49 ` NeilBrown
2011-06-09 11:32   ` David Brown
2011-06-09 12:04     ` NeilBrown
2011-06-09 19:19       ` David Brown [this message]
2011-06-10  3:22       ` Namhyung Kim
2011-06-10  8:45         ` David Brown
2011-06-10 12:20           ` Christoph Dittmann
2011-06-10 14:28             ` David Brown
2011-06-11 10:13               ` Piergiorgio Sartor
2011-06-11 11:51                 ` David Brown
2011-06-11 13:18                   ` Piergiorgio Sartor
2011-06-11 14:53                     ` David Brown
2011-06-11 15:05                       ` Joe Landman
2011-06-11 16:31                         ` David Brown
2011-06-11 16:57                           ` Joe Landman
2011-06-12  9:05                             ` David Brown
2011-06-11 17:14                           ` Joe Landman
2011-06-11 18:05                             ` David Brown
2011-06-10  9:03       ` David Brown
2011-06-10 13:56       ` Bill Davidsen
2011-06-09 22:42 ` David Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='isr6c7$7pu$1@dough.gmane.org' \
    --to=david.brown@hesbynett.no \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).