Triple parity and beyond

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Triple parity and beyond
@ 2013-11-18 22:08 Andrea Mazzoleni
  2013-11-18 22:12 ` H. Peter Anvin
                   ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-18 22:08 UTC (permalink / raw)
  To: linux-raid, linux-btrfs; +Cc: hpa, david.brown, creamyfish

Hi,

I want to report that I recently implemented a support for
arbitrary number of parities that could be useful also for Linux
RAID and Btrfs, both currently limited to double parity.

In short, to generate the parity I use a Cauchy matrix specifically
built to be compatible with the existing Linux parity computation,
and extensible to an arbitrary number of parities. This without
limitations on the number of data disks.

The Cauchy matrix for six parities is:

01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01...
01 02 04 08 10 20 40 80 1d 3a 74 e8 cd 87 13 26 4c 98 2d 5a b4 75...
01 f5 d2 c4 9a 71 f1 7f fc 87 c1 c6 19 2f 40 55 3d ba 53 04 9c 61...
01 bb a6 d7 c7 07 ce 82 4a 2f a5 9b b6 60 f1 ad e7 f4 06 d2 df 2e...
01 97 7f 9c 7c 18 bd a2 58 1a da 74 70 a3 e5 47 29 07 f5 80 23 e9...
01 2b 3f cf 73 2c d6 ed cb 74 15 78 8a c1 17 c9 89 68 21 ab 76 3b...

You can easily recognize the first row as RAID5 based on a simple
XOR, and the second row as RAID6 based on multiplications by powers
of 2. The other rows are for additional parity levels and they
require multiplications by arbitrary values that can be implemented
using the PSHUFB instruction.

The performance of triple parity with PSHUFB is comparable at an
alternate triple parity implementation with the third row of
coefficients set as powers of 2^-1. This alternate implementation is
likely the fastest possible for CPUs without PSHUFB or similar
instruction, but it has the limitation of not supporting beyond triple
parity.

The Cauchy matrix is instead working for any number of parities and
at the same time it's compatible with the existing first two parity
levels. As far as I know, this is a kind of new result, never
appeared in this list or somewhere else.

You can see more details, performance results and fast
implementations for up to six parity levels at:
https://sourceforge.net/p/snapraid/code/ci/master/tree/raid.c

This was developed as part of my hobby project SnapRAID,
downloadable with full source at:
http://snapraid.sourceforge.net/

Please let me know if you are interested in a potential Linux
integration. I can surely help on whatever is needed.

For reference, past discussions about triple parity in the
linux-raid list can be found at:

http://thread.gmane.org/gmane.linux.raid/34195
http://thread.gmane.org/gmane.linux.raid/37904

Ciao,
Andrea

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-18 22:08 Andrea Mazzoleni
@ 2013-11-18 22:12 ` H. Peter Anvin
  2013-11-18 22:35   ` Andrea Mazzoleni
  2013-11-19 18:12 ` Piergiorgio Sartor
  2013-11-20 22:29 ` Piergiorgio Sartor
  2 siblings, 1 reply; 104+ messages in thread
From: H. Peter Anvin @ 2013-11-18 22:12 UTC (permalink / raw)
  To: Andrea Mazzoleni, linux-raid, linux-btrfs; +Cc: david.brown, creamyfish

On 11/18/2013 02:08 PM, Andrea Mazzoleni wrote:
> Hi,
> 
> I want to report that I recently implemented a support for
> arbitrary number of parities that could be useful also for Linux
> RAID and Btrfs, both currently limited to double parity.
> 
> In short, to generate the parity I use a Cauchy matrix specifically
> built to be compatible with the existing Linux parity computation,
> and extensible to an arbitrary number of parities. This without
> limitations on the number of data disks.
> 
> The Cauchy matrix for six parities is:
> 
> 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01...
> 01 02 04 08 10 20 40 80 1d 3a 74 e8 cd 87 13 26 4c 98 2d 5a b4 75...
> 01 f5 d2 c4 9a 71 f1 7f fc 87 c1 c6 19 2f 40 55 3d ba 53 04 9c 61...
> 01 bb a6 d7 c7 07 ce 82 4a 2f a5 9b b6 60 f1 ad e7 f4 06 d2 df 2e...
> 01 97 7f 9c 7c 18 bd a2 58 1a da 74 70 a3 e5 47 29 07 f5 80 23 e9...
> 01 2b 3f cf 73 2c d6 ed cb 74 15 78 8a c1 17 c9 89 68 21 ab 76 3b...
> 
> You can easily recognize the first row as RAID5 based on a simple
> XOR, and the second row as RAID6 based on multiplications by powers
> of 2. The other rows are for additional parity levels and they
> require multiplications by arbitrary values that can be implemented
> using the PSHUFB instruction.
> 

Hello,

This looks very interesting indeed.  Could you perhaps describe how the
Cauchy matrix is derived, and under what conditions it would become
singular?

	-hpa



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-18 22:12 ` H. Peter Anvin
@ 2013-11-18 22:35   ` Andrea Mazzoleni
  2013-11-18 23:25     ` H. Peter Anvin
  0 siblings, 1 reply; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-18 22:35 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid, linux-btrfs, david.brown, creamyfish

Hi Peter,

The Cauchy matrix has the mathematical property to always have itself
and all submatrices not singular. So, we are sure that we can always
solve the equations to recover the data disks.

Besides the mathematical proof, I've also inverted all the
377,342,351,231 possible submatrices for up to 6 parities and 251 data
disks, and got an experimental confirmation of this.

The only limit is coming from the GF(2^8). You have a maximum number
of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
you can have no more of 251 data disks. Over this limit it's not
possible to build a Cauchy matrix.

Note that instead with a Vandermonde matrix you don't have the
guarantee to always have all the submatrices not singular. This is the
reason because using power coefficients, before or late, it happens to
have unsolvable equations.

You can find the code that generate the Cauchy matrix with some
explanation in the comments at (see the set_cauchy() function) :

http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c

Ciao,
Andrea

On Mon, Nov 18, 2013 at 11:12 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 11/18/2013 02:08 PM, Andrea Mazzoleni wrote:
>> Hi,
>>
>> I want to report that I recently implemented a support for
>> arbitrary number of parities that could be useful also for Linux
>> RAID and Btrfs, both currently limited to double parity.
>>
>> In short, to generate the parity I use a Cauchy matrix specifically
>> built to be compatible with the existing Linux parity computation,
>> and extensible to an arbitrary number of parities. This without
>> limitations on the number of data disks.
>>
>> The Cauchy matrix for six parities is:
>>
>> 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01...
>> 01 02 04 08 10 20 40 80 1d 3a 74 e8 cd 87 13 26 4c 98 2d 5a b4 75...
>> 01 f5 d2 c4 9a 71 f1 7f fc 87 c1 c6 19 2f 40 55 3d ba 53 04 9c 61...
>> 01 bb a6 d7 c7 07 ce 82 4a 2f a5 9b b6 60 f1 ad e7 f4 06 d2 df 2e...
>> 01 97 7f 9c 7c 18 bd a2 58 1a da 74 70 a3 e5 47 29 07 f5 80 23 e9...
>> 01 2b 3f cf 73 2c d6 ed cb 74 15 78 8a c1 17 c9 89 68 21 ab 76 3b...
>>
>> You can easily recognize the first row as RAID5 based on a simple
>> XOR, and the second row as RAID6 based on multiplications by powers
>> of 2. The other rows are for additional parity levels and they
>> require multiplications by arbitrary values that can be implemented
>> using the PSHUFB instruction.
>>
>
> Hello,
>
> This looks very interesting indeed.  Could you perhaps describe how the
> Cauchy matrix is derived, and under what conditions it would become
> singular?
>
>         -hpa
>
>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-18 22:35   ` Andrea Mazzoleni
@ 2013-11-18 23:25     ` H. Peter Anvin
  2013-11-19 10:16       ` David Brown
  2013-11-19 17:28       ` Andrea Mazzoleni
  0 siblings, 2 replies; 104+ messages in thread
From: H. Peter Anvin @ 2013-11-18 23:25 UTC (permalink / raw)
  To: Andrea Mazzoleni; +Cc: linux-raid, linux-btrfs, david.brown, creamyfish

On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
> Hi Peter,
> 
> The Cauchy matrix has the mathematical property to always have itself
> and all submatrices not singular. So, we are sure that we can always
> solve the equations to recover the data disks.
> 
> Besides the mathematical proof, I've also inverted all the
> 377,342,351,231 possible submatrices for up to 6 parities and 251 data
> disks, and got an experimental confirmation of this.
> 

Nice.

>
> The only limit is coming from the GF(2^8). You have a maximum number
> of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
> you can have no more of 251 data disks. Over this limit it's not
> possible to build a Cauchy matrix.
> 

251?  Not 255?

> Note that instead with a Vandermonde matrix you don't have the
> guarantee to always have all the submatrices not singular. This is the
> reason because using power coefficients, before or late, it happens to
> have unsolvable equations.
> 
> You can find the code that generate the Cauchy matrix with some
> explanation in the comments at (see the set_cauchy() function) :
> 
> http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c

OK, need to read up on the theoretical aspects of this, but it sounds
promising.

	-hpa



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-18 23:25     ` H. Peter Anvin
@ 2013-11-19 10:16       ` David Brown
  2013-11-19 17:36         ` Andrea Mazzoleni
  2013-11-19 17:28       ` Andrea Mazzoleni
  1 sibling, 1 reply; 104+ messages in thread
From: David Brown @ 2013-11-19 10:16 UTC (permalink / raw)
  To: H. Peter Anvin, Andrea Mazzoleni; +Cc: linux-raid, linux-btrfs, creamyfish

On 19/11/13 00:25, H. Peter Anvin wrote:
> On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
>> Hi Peter,
>>
>> The Cauchy matrix has the mathematical property to always have itself
>> and all submatrices not singular. So, we are sure that we can always
>> solve the equations to recover the data disks.
>>
>> Besides the mathematical proof, I've also inverted all the
>> 377,342,351,231 possible submatrices for up to 6 parities and 251 data
>> disks, and got an experimental confirmation of this.
>>
> 
> Nice.
> 
>>
>> The only limit is coming from the GF(2^8). You have a maximum number
>> of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
>> you can have no more of 251 data disks. Over this limit it's not
>> possible to build a Cauchy matrix.
>>
> 
> 251?  Not 255?
> 
>> Note that instead with a Vandermonde matrix you don't have the
>> guarantee to always have all the submatrices not singular. This is the
>> reason because using power coefficients, before or late, it happens to
>> have unsolvable equations.
>>
>> You can find the code that generate the Cauchy matrix with some
>> explanation in the comments at (see the set_cauchy() function) :
>>
>> http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c
> 
> OK, need to read up on the theoretical aspects of this, but it sounds
> promising.
> 
> 	-hpa
> 

Hi all,

A while back I worked through the maths for a method of extending raid
to multiple parities, though I never got as far as implementing it in
code (other than some simple Python test code to confirm the maths).  It
is also missing the maths for simplified ways to recover data.  I've
posted a couple of times with this on the linux-raid mailing list (as
linked in this thread) - there has certainly been some interest, but
it's not easy to turn interest into hard work!

I used an obvious expansion on the existing RAID5 and RAID6 algorithms,
with parity P_n being generated from powers of 2^n.  This means that the
triple-parity version can be implemented by simply applying the RAID6
operations twice.  For a triple parity, this works well - the matrices
involved are all invertible up to 255 data disks.  Beyond that, however,
things drop off rapidly - quad parity implemented in the same way only
supports 21 data disks, and for five parity disks you need to use 0x20
(skipping 0x10) to get even 8 data disks.

This means that my method would be fine for triple parity, and would
also be efficient in implementation.

Beyond triple parity, the simple method has size limits for four parity
and is no use on anything bigger.  The Cauchy matrix method lets us go
beyond that (I haven't yet studied your code and your maths - I will do
so as soon as I have the chance, but I doubt if that will be before the
weekend).

Would it be possible to use the simple parity system for the first three
parities, and Cauchy beyond that?  That would give the best of both worlds.

The important thing to think about here is what would actually be useful
in the real world.  It is always nice to have a system that can make an
array with 251 data disks and 6 parities (and I certainly think the
maths involved is fun), but would anyone use such a beast?

Triple parity has clear use cases.  As people have moved up from raid5
to raid6, "raid7" or "raid6-3p" would be an obvious next step.  I also
see it as being useful for maintenance on raid6 arrays - if you want to
replace disks on a raid6 array you could first add a third parity disk
with an asymmetric layout, then you could replace the main disks while
keeping two disk redundancy at all times.

Quad parity is unlikely, I think - you would need a very wide array and
unusual requirements to make quad parity a better choice than a layered
system of raid10 or raid15.  At most, I think it would find use as a
temporary security while maintaining a triple-raid array.  Remember also
that such an array would be painfully slow if it ever needed to rebuild
data with four missing disks - and if it is then too slow to be usable,
then quad parity is not a useful solution.

(Obviously anyone with /real/ experience with large arrays can give
better ideas here - I like the maths of multi-parity raid, but I will
not it for my small arrays.)

Of course I will enjoy studying your maths here, and I'll try to give
some feedback on it.  But I think for implementation purposes, the
simple "powers of 4" generation of triple parity would be better than
using the Cauchy matrix - it is a clear step from the existing raid6,
and it can work fast on a wide variety of processors (people use ARMs
and other "small" cpus on raids, not just x86 with SSE3).  I believe
that would mean simpler code and fewer changes, which is always popular
with the kernel folk.

However, if it is not possible to use Cauchy matrices to get four and
more parity while keeping the same first three parities, then the
balance changes and a decision needs to be made - do we (the Linux
kernel developers, the btrfs developers, and the users) want a simpler
system that is limited to triple parity (or quad parity with 21 + 4
disks), or do we want a more complex but more flexible system?

Personally, I don't mind either way, as long as we get a good technical
solution.  And I'll do what I can to help with the maths in either case.

David

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-18 23:25     ` H. Peter Anvin
  2013-11-19 10:16       ` David Brown
@ 2013-11-19 17:28       ` Andrea Mazzoleni
  2013-11-19 20:29         ` Ric Wheeler
  1 sibling, 1 reply; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-19 17:28 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid, linux-btrfs, David Brown, David Smith

Hi Peter,

Yes, 251 disks for 6 parity.

To build a NxM Cauchy matrix you need to pick N+M distinct values
in the GF(2^8) and we have only 2^8 == 256 available.
This means that for every row we add for an extra parity level, we
have to remove one of the disk columns.

Note that in true, I use an Extended Cauchy matrix that gives the
first row of 1 for "free". This results in N+M <= 256+1.
So, DISKS = 257 - PARITY -> 251 = 257 - 6

A brief introduction of Cauchy and Extended Cauchy matrix can be found in:

Vinocha, "On Generator Cauchy Matrices of GDRS/GTRS Codes", 2012
http://www.m-hikari.com/ijcms/ijcms-2012/45-48-2012/brarIJCMS45-48-2012.pdf
(just check the Introduction, the rest is not related)

More details can be found in:

Roth, "Introduction to Coding Theory", 2006
http://carlossicoli.free.fr/R/Roth_R.-Introduction_to_coding_theory-Cambridge_University_Press%282006%29.pdf
(search for "Extended Cauchy")

Ciao,
Andrea

On Tue, Nov 19, 2013 at 12:25 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
>> Hi Peter,
>>
>> The Cauchy matrix has the mathematical property to always have itself
>> and all submatrices not singular. So, we are sure that we can always
>> solve the equations to recover the data disks.
>>
>> Besides the mathematical proof, I've also inverted all the
>> 377,342,351,231 possible submatrices for up to 6 parities and 251 data
>> disks, and got an experimental confirmation of this.
>>
>
> Nice.
>
>>
>> The only limit is coming from the GF(2^8). You have a maximum number
>> of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
>> you can have no more of 251 data disks. Over this limit it's not
>> possible to build a Cauchy matrix.
>>
>
> 251?  Not 255?
>
>> Note that instead with a Vandermonde matrix you don't have the
>> guarantee to always have all the submatrices not singular. This is the
>> reason because using power coefficients, before or late, it happens to
>> have unsolvable equations.
>>
>> You can find the code that generate the Cauchy matrix with some
>> explanation in the comments at (see the set_cauchy() function) :
>>
>> http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c
>
> OK, need to read up on the theoretical aspects of this, but it sounds
> promising.
>
>         -hpa
>
>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-19 10:16       ` David Brown
@ 2013-11-19 17:36         ` Andrea Mazzoleni
  2013-11-19 22:51           ` Drew
  2013-11-20 10:31           ` David Brown
  0 siblings, 2 replies; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-19 17:36 UTC (permalink / raw)
  To: David Brown; +Cc: H. Peter Anvin, linux-raid, linux-btrfs, David Smith

Hi David,

Just to say that I know your good past work, and it helped me a lot.
Thanks for that!

Unfortunately the Cauchy matrix is not compatible with a triple parity
implementation using power coefficients. They are different and
incompatible roads.

I partially agree on your considerations, and in fact in my sources
you can also see an alternate triple parity implementation using powers
of 2^-1 == 1/2 == 0x8e, intended for CPUs not supporting PSHUFB.
This is faster than using powers of 2^2 == 4, because we can divide
by 2 as fast as we can multiply by 2.
The choice of ZFS to use powers of 4 was likely not optimal,
because to multiply by 4, it has to do two multiplications by 2.
Also this method doesn't work for quad parity, because it fails with
more than 16 data disks.

What I tend to do not agree, is to give too importance to low end
architectures, that don't support PSHUFB or similar instruction.
Such architectures can just stay with two parity levels.
Consider that to have a fast recovering (running in degraded mode)
you need anyway PSHUFB to have acceptable performance.
In my system I can generate triple parity at 10GB/s using SSE2,
but recover only at 100MB/s without SSSE3 PSHUFB. It's a slowdown
of x100! With PSHUFB is a bit better and I can recover at 500MB/s.
Note also that the ARM NEON architecture introduced the VTBL
instruction, and AMD introduced VPPERM, that could be used like
PSHUFB.

For the complexity point of view, I don't any see difference between
the two methods.
They are just two matrix with different coefficients sharing the same
recovering functions. The only difference is in the optimized parity
generation that uses SSSE3 instead of SSE2.

Anyway, I cannot tell what is the best option for Linux RAID and Btrfs.
There are for sure better qualified people in this list to say that.
I can just say that systems using multiple parity levels do exist, and
maybe also the Linux Kernel could benefit to have such kind of support.

Here some examples:

Oracle/Sun, Dell/Compellent ZFS: 3 parity drives
NEC HydraStor: 3 parity drives
EMC/Isilon: 4 parity drives
Amplidata: 4 parity drives
CleverSafe: 6 parity drives
StreamScale/BigParity: 7 parity drives

And Btrfs with six parities would be surely cool :)

Ciao,
Andrea

On Tue, Nov 19, 2013 at 11:16 AM, David Brown <david.brown@hesbynett.no> wrote:
> On 19/11/13 00:25, H. Peter Anvin wrote:
>> On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
>>> Hi Peter,
>>>
>>> The Cauchy matrix has the mathematical property to always have itself
>>> and all submatrices not singular. So, we are sure that we can always
>>> solve the equations to recover the data disks.
>>>
>>> Besides the mathematical proof, I've also inverted all the
>>> 377,342,351,231 possible submatrices for up to 6 parities and 251 data
>>> disks, and got an experimental confirmation of this.
>>>
>>
>> Nice.
>>
>>>
>>> The only limit is coming from the GF(2^8). You have a maximum number
>>> of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
>>> you can have no more of 251 data disks. Over this limit it's not
>>> possible to build a Cauchy matrix.
>>>
>>
>> 251?  Not 255?
>>
>>> Note that instead with a Vandermonde matrix you don't have the
>>> guarantee to always have all the submatrices not singular. This is the
>>> reason because using power coefficients, before or late, it happens to
>>> have unsolvable equations.
>>>
>>> You can find the code that generate the Cauchy matrix with some
>>> explanation in the comments at (see the set_cauchy() function) :
>>>
>>> http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c
>>
>> OK, need to read up on the theoretical aspects of this, but it sounds
>> promising.
>>
>>       -hpa
>>
>
> Hi all,
>
> A while back I worked through the maths for a method of extending raid
> to multiple parities, though I never got as far as implementing it in
> code (other than some simple Python test code to confirm the maths).  It
> is also missing the maths for simplified ways to recover data.  I've
> posted a couple of times with this on the linux-raid mailing list (as
> linked in this thread) - there has certainly been some interest, but
> it's not easy to turn interest into hard work!
>
> I used an obvious expansion on the existing RAID5 and RAID6 algorithms,
> with parity P_n being generated from powers of 2^n.  This means that the
> triple-parity version can be implemented by simply applying the RAID6
> operations twice.  For a triple parity, this works well - the matrices
> involved are all invertible up to 255 data disks.  Beyond that, however,
> things drop off rapidly - quad parity implemented in the same way only
> supports 21 data disks, and for five parity disks you need to use 0x20
> (skipping 0x10) to get even 8 data disks.
>
> This means that my method would be fine for triple parity, and would
> also be efficient in implementation.
>
> Beyond triple parity, the simple method has size limits for four parity
> and is no use on anything bigger.  The Cauchy matrix method lets us go
> beyond that (I haven't yet studied your code and your maths - I will do
> so as soon as I have the chance, but I doubt if that will be before the
> weekend).
>
> Would it be possible to use the simple parity system for the first three
> parities, and Cauchy beyond that?  That would give the best of both worlds.
>
>
>
> The important thing to think about here is what would actually be useful
> in the real world.  It is always nice to have a system that can make an
> array with 251 data disks and 6 parities (and I certainly think the
> maths involved is fun), but would anyone use such a beast?
>
> Triple parity has clear use cases.  As people have moved up from raid5
> to raid6, "raid7" or "raid6-3p" would be an obvious next step.  I also
> see it as being useful for maintenance on raid6 arrays - if you want to
> replace disks on a raid6 array you could first add a third parity disk
> with an asymmetric layout, then you could replace the main disks while
> keeping two disk redundancy at all times.
>
> Quad parity is unlikely, I think - you would need a very wide array and
> unusual requirements to make quad parity a better choice than a layered
> system of raid10 or raid15.  At most, I think it would find use as a
> temporary security while maintaining a triple-raid array.  Remember also
> that such an array would be painfully slow if it ever needed to rebuild
> data with four missing disks - and if it is then too slow to be usable,
> then quad parity is not a useful solution.
>
>
> (Obviously anyone with /real/ experience with large arrays can give
> better ideas here - I like the maths of multi-parity raid, but I will
> not it for my small arrays.)
>
>
>
> Of course I will enjoy studying your maths here, and I'll try to give
> some feedback on it.  But I think for implementation purposes, the
> simple "powers of 4" generation of triple parity would be better than
> using the Cauchy matrix - it is a clear step from the existing raid6,
> and it can work fast on a wide variety of processors (people use ARMs
> and other "small" cpus on raids, not just x86 with SSE3).  I believe
> that would mean simpler code and fewer changes, which is always popular
> with the kernel folk.
>
> However, if it is not possible to use Cauchy matrices to get four and
> more parity while keeping the same first three parities, then the
> balance changes and a decision needs to be made - do we (the Linux
> kernel developers, the btrfs developers, and the users) want a simpler
> system that is limited to triple parity (or quad parity with 21 + 4
> disks), or do we want a more complex but more flexible system?
>
> Personally, I don't mind either way, as long as we get a good technical
> solution.  And I'll do what I can to help with the maths in either case.
>
> David
>
>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-18 22:08 Andrea Mazzoleni
  2013-11-18 22:12 ` H. Peter Anvin
@ 2013-11-19 18:12 ` Piergiorgio Sartor
  2013-11-20 10:44   ` David Brown
  2013-11-20 21:38   ` Andrea Mazzoleni
  2013-11-20 22:29 ` Piergiorgio Sartor
  2 siblings, 2 replies; 104+ messages in thread
From: Piergiorgio Sartor @ 2013-11-19 18:12 UTC (permalink / raw)
  To: Andrea Mazzoleni; +Cc: linux-raid, linux-btrfs, hpa, david.brown, creamyfish

On Mon, Nov 18, 2013 at 11:08:59PM +0100, Andrea Mazzoleni wrote:
> Hi,
> 
> I want to report that I recently implemented a support for
> arbitrary number of parities that could be useful also for Linux
> RAID and Btrfs, both currently limited to double parity.
> 
> In short, to generate the parity I use a Cauchy matrix specifically
> built to be compatible with the existing Linux parity computation,
> and extensible to an arbitrary number of parities. This without
> limitations on the number of data disks.
> 
> The Cauchy matrix for six parities is:
> 
> 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01...
> 01 02 04 08 10 20 40 80 1d 3a 74 e8 cd 87 13 26 4c 98 2d 5a b4 75...
> 01 f5 d2 c4 9a 71 f1 7f fc 87 c1 c6 19 2f 40 55 3d ba 53 04 9c 61...
> 01 bb a6 d7 c7 07 ce 82 4a 2f a5 9b b6 60 f1 ad e7 f4 06 d2 df 2e...
> 01 97 7f 9c 7c 18 bd a2 58 1a da 74 70 a3 e5 47 29 07 f5 80 23 e9...
> 01 2b 3f cf 73 2c d6 ed cb 74 15 78 8a c1 17 c9 89 68 21 ab 76 3b...
> 
> You can easily recognize the first row as RAID5 based on a simple
> XOR, and the second row as RAID6 based on multiplications by powers
> of 2. The other rows are for additional parity levels and they
> require multiplications by arbitrary values that can be implemented
> using the PSHUFB instruction.
> 
> The performance of triple parity with PSHUFB is comparable at an
> alternate triple parity implementation with the third row of
> coefficients set as powers of 2^-1. This alternate implementation is
> likely the fastest possible for CPUs without PSHUFB or similar
> instruction, but it has the limitation of not supporting beyond triple
> parity.
> 
> The Cauchy matrix is instead working for any number of parities and
> at the same time it's compatible with the existing first two parity
> levels. As far as I know, this is a kind of new result, never
> appeared in this list or somewhere else.
> 
> You can see more details, performance results and fast
> implementations for up to six parity levels at:
> https://sourceforge.net/p/snapraid/code/ci/master/tree/raid.c
> 
> This was developed as part of my hobby project SnapRAID,
> downloadable with full source at:
> http://snapraid.sourceforge.net/
> 
> Please let me know if you are interested in a potential Linux
> integration. I can surely help on whatever is needed.
> 
> For reference, past discussions about triple parity in the
> linux-raid list can be found at:
> 
> http://thread.gmane.org/gmane.linux.raid/34195
> http://thread.gmane.org/gmane.linux.raid/37904

Hi Andrea,

great job, this was exactly what I was looking for.

Do you know if there is a "fast" way not to correct
errors, but to find them?

In RAID-6 (as per raid6check) there is an easy way
to verify where an HDD has incorrect data.

I suspect, for each 2 parity block it should be
possible to find 1 error (and if this is true, then
quad parity is more attractive than triple one).

Furthermore, my second (of first) target would
be something like: http://www.symform.com/blog/tag/raid-96/

Which uses 32 parities (out of 96 "disks").

Keep going!!!

bye,

pg

> 
> Ciao,
> Andrea
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-19 17:28       ` Andrea Mazzoleni
@ 2013-11-19 20:29         ` Ric Wheeler
  2013-11-20 16:16           ` James Plank
  0 siblings, 1 reply; 104+ messages in thread
From: Ric Wheeler @ 2013-11-19 20:29 UTC (permalink / raw)
  To: Andrea Mazzoleni, H. Peter Anvin
  Cc: linux-raid, linux-btrfs, David Brown, David Smith, James Plank

On 11/19/2013 12:28 PM, Andrea Mazzoleni wrote:
> Hi Peter,
>
> Yes, 251 disks for 6 parity.
>
> To build a NxM Cauchy matrix you need to pick N+M distinct values
> in the GF(2^8) and we have only 2^8 == 256 available.
> This means that for every row we add for an extra parity level, we
> have to remove one of the disk columns.
>
> Note that in true, I use an Extended Cauchy matrix that gives the
> first row of 1 for "free". This results in N+M <= 256+1.
> So, DISKS = 257 - PARITY -> 251 = 257 - 6
>
> A brief introduction of Cauchy and Extended Cauchy matrix can be found in:
>
> Vinocha, "On Generator Cauchy Matrices of GDRS/GTRS Codes", 2012
> http://www.m-hikari.com/ijcms/ijcms-2012/45-48-2012/brarIJCMS45-48-2012.pdf
> (just check the Introduction, the rest is not related)
>
> More details can be found in:
>
> Roth, "Introduction to Coding Theory", 2006
> http://carlossicoli.free.fr/R/Roth_R.-Introduction_to_coding_theory-Cambridge_University_Press%282006%29.pdf
> (search for "Extended Cauchy")
>
> Ciao,
> Andrea

Great work - we have waited a long time for this. Adding in Jim Plank who did 
some great talks and work in this area as well :)

Ric

>
> On Tue, Nov 19, 2013 at 12:25 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
>>> Hi Peter,
>>>
>>> The Cauchy matrix has the mathematical property to always have itself
>>> and all submatrices not singular. So, we are sure that we can always
>>> solve the equations to recover the data disks.
>>>
>>> Besides the mathematical proof, I've also inverted all the
>>> 377,342,351,231 possible submatrices for up to 6 parities and 251 data
>>> disks, and got an experimental confirmation of this.
>>>
>> Nice.
>>
>>> The only limit is coming from the GF(2^8). You have a maximum number
>>> of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
>>> you can have no more of 251 data disks. Over this limit it's not
>>> possible to build a Cauchy matrix.
>>>
>> 251?  Not 255?
>>
>>> Note that instead with a Vandermonde matrix you don't have the
>>> guarantee to always have all the submatrices not singular. This is the
>>> reason because using power coefficients, before or late, it happens to
>>> have unsolvable equations.
>>>
>>> You can find the code that generate the Cauchy matrix with some
>>> explanation in the comments at (see the set_cauchy() function) :
>>>
>>> http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c
>> OK, need to read up on the theoretical aspects of this, but it sounds
>> promising.
>>
>>          -hpa
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-19 17:36         ` Andrea Mazzoleni
@ 2013-11-19 22:51           ` Drew
  2013-11-20  0:54             ` Chris Murphy
  2013-11-20 10:31           ` David Brown
  1 sibling, 1 reply; 104+ messages in thread
From: Drew @ 2013-11-19 22:51 UTC (permalink / raw)
  To: Andrea Mazzoleni
  Cc: David Brown, H. Peter Anvin, Linux RAID Mailing List, linux-btrfs,
	David Smith

I'm not going to claim any expert status on this discussion (the
theory makes my head spin) but I will say I agree with Andrea as far
as prefering his implementation for triple parity and beyond.

PSHUFB has been around the intel platform since the Core2 introduced
it as part of SSSE3 back in Q1 2006. The generation of Intel based
servers that ran pre-Core Xeons are long in the tooth and this is a
value judgement but if your data is big enough you need triple parity,
you probably shouldn't be running it from an ten year old platform.

But that's just me. :-)

-- 
Drew

"Nothing in life is to be feared. It is only to be understood."
--Marie Curie

"This started out as a hobby and spun horribly out of control."
-Unknown

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-19 22:51           ` Drew
@ 2013-11-20  0:54             ` Chris Murphy
  2013-11-20  1:23               ` John Williams
  0 siblings, 1 reply; 104+ messages in thread
From: Chris Murphy @ 2013-11-20  0:54 UTC (permalink / raw)
  To: Linux RAID Mailing List; +Cc: Btrfs BTRFS


On Nov 19, 2013, at 3:51 PM, Drew <drew.kay@gmail.com> wrote:

> I'm not going to claim any expert status on this discussion (the
> theory makes my head spin) but I will say I agree with Andrea as far
> as prefering his implementation for triple parity and beyond.
> 
> PSHUFB has been around the intel platform since the Core2 introduced
> it as part of SSSE3 back in Q1 2006. The generation of Intel based
> servers that ran pre-Core Xeons are long in the tooth and this is a
> value judgement but if your data is big enough you need triple parity,
> you probably shouldn't be running it from an ten year old platform.

If anything, I'd like to see two implementations of RAID 6 dual parity. The existing implementation in the md driver and btrfs could remain the default, but users could opt into Cauchy matrix based dual parity which would then enable them an easy (and live) migration path to triple parity and beyond.


Chris Murphy

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20  0:54             ` Chris Murphy
@ 2013-11-20  1:23               ` John Williams
  2013-11-20 10:35                 ` David Brown
  0 siblings, 1 reply; 104+ messages in thread
From: John Williams @ 2013-11-20  1:23 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux RAID Mailing List, Btrfs BTRFS

On Tue, Nov 19, 2013 at 4:54 PM, Chris Murphy <lists@colorremedies.com>
wrote:
> If anything, I'd like to see two implementations of RAID 6 dual
> parity. The existing implementation in the md driver and btrfs could
> remain the default, but users could opt into Cauchy matrix based dual
> parity which would then enable them an easy (and live) migration path
> to triple parity and beyond.

Actually, my understanding is that Andrea's Cauchy matrix technique
(call it C) is compatible with existing md RAID5 and RAID6 (call these
A). It is only the non-SSSE3 triple-parity algorithm 2^-1 (call it B)
that is incompatible with his Cauchy matrix technique.

So, you can have:

1) A+B

or

2) A+C

But you cannot have A+B+C

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-19 17:36         ` Andrea Mazzoleni
  2013-11-19 22:51           ` Drew
@ 2013-11-20 10:31           ` David Brown
  2013-11-20 18:09             ` John Williams
  2013-11-20 18:34             ` Andrea Mazzoleni
  1 sibling, 2 replies; 104+ messages in thread
From: David Brown @ 2013-11-20 10:31 UTC (permalink / raw)
  To: Andrea Mazzoleni; +Cc: H. Peter Anvin, linux-raid, linux-btrfs, David Smith

On 19/11/13 18:36, Andrea Mazzoleni wrote:
> Hi David,
> 
> Just to say that I know your good past work, and it helped me a lot.
> Thanks for that!

If we end up with your Cauchy matrix implementation going into the
kernel and btrfs (and you've persuaded me, anyway), then perhaps I
should make a new version of that document with the Cauchy matrix maths.
 I know I have found Peter's original Raid 6 document very useful and
informative, and I think it would be good to have an equivalent here
too.  A lot of people find the maths of Raid 6 hard to grasp - it
doesn't get easier with triple parity, and such documentation could help
future kernel/btrfs developers.

> 
> Unfortunately the Cauchy matrix is not compatible with a triple parity
> implementation using power coefficients. They are different and
> incompatible roads.

Yes.  I haven't worked through all the details of how Cauchy matrices
work - but I have read enough to see that this is correct.

I am sure that it is /possible/ to make matrices that have the first
three rows matching the power coefficient triple parity with at least 6
rows in total and a useful horizontal size that have the required
property (all square submatrices are invertible) - but I have no idea
how to find such matrices in a sensible time frame.  The key point with
the Cauchy matrix is not the particular form of the coefficients, but
that it gives a way to generate coefficients easily.

> 
> I partially agree on your considerations, and in fact in my sources
> you can also see an alternate triple parity implementation using powers
> of 2^-1 == 1/2 == 0x8e, intended for CPUs not supporting PSHUFB.
> This is faster than using powers of 2^2 == 4, because we can divide
> by 2 as fast as we can multiply by 2.
> The choice of ZFS to use powers of 4 was likely not optimal,
> because to multiply by 4, it has to do two multiplications by 2.

I can agree with that.  I didn't copy ZFS's choice here (I knew that ZFS
/had/ triple parity, but I think my suggestion is not exactly the same)
- I just reached the same conclusion that since we have an operation
that works fast, doing it twice is not going to be an unreasonable cost.
 It hadn't occurred to me that dividing by 2 was equally easy.

> Also this method doesn't work for quad parity, because it fails with
> more than 16 data disks.

Indeed.

> 
> What I tend to do not agree, is to give too importance to low end
> architectures, that don't support PSHUFB or similar instruction.
> Such architectures can just stay with two parity levels.

That's certainly a reasonable way to look at it.  We should not limit
the possibilities for high-end systems because of the limitations of
low-end systems that are unlikely to use 3+ parity anyway.  I've also
looked up a list of the processors that support SSE3 and PSHUFB - a lot
of modern "low-end" x86 cpus support it.  And of course it is possible
to implement general G(2^8) multiplication without PSHUFB, using a
lookup table - it is important that this can all work with any CPU, even
if it is slow.

> Consider that to have a fast recovering (running in degraded mode)
> you need anyway PSHUFB to have acceptable performance.

I don't think we need to be very concerned with fast recovery using the
third (or more) parity.  If one disk (or sector on a disk) has failed,
recovery is done with the RAID5 parity block.  If we have two failures,
we can use the RAID6 procedure (I haven't looked at the implementation
for this at the moment, or for your PSHUFB code - but since RAID6
recovery also involves general multiplication, there may be scope for
speedups here).  It is only with three simultaneous failures that we
need to use additional parities - such situations should be very rare.
Typically this would be only for stripes with unrecoverable read errors
found while rebuilding for another missed disk or two.

Of course, it is always nice to have fast recovery - but fast generation
of parities is far more important.

> In my system I can generate triple parity at 10GB/s using SSE2,
> but recover only at 100MB/s without SSSE3 PSHUFB. It's a slowdown
> of x100! With PSHUFB is a bit better and I can recover at 500MB/s.

The difference here is bigger than I would have guessed, but I haven't
looked at the code yet.

> Note also that the ARM NEON architecture introduced the VTBL
> instruction, and AMD introduced VPPERM, that could be used like
> PSHUFB.

Sounds good - at least it will be fun trying to figure out optimal code.

> 
> For the complexity point of view, I don't any see difference between
> the two methods.
> They are just two matrix with different coefficients sharing the same
> recovering functions. The only difference is in the optimized parity
> generation that uses SSSE3 instead of SSE2.

Recovery is the same, yes.  It is only the parity generation that is
different - multiplying by powers of 2 means each step is a fast
multiply-by-two, with Horner's rule to avoid any other multiplication.
With parity 3 generated as powers of 4 or 2^-1, you have the same system
with only a slightly slower multiply-by-4 step.  With the Cauchy matrix,
you need general multiplication with different coefficients for each
disk block.  This is significantly more complex - but if it can be done
fast enough on at least a reasonable selection of processors, it's okay
to be complex.

> 
> Anyway, I cannot tell what is the best option for Linux RAID and Btrfs.
> There are for sure better qualified people in this list to say that.

I think H. Peter Anvin is the best qualified for such decisions - I
believe he has the most experience and understanding in this area.

For what it is worth, you have convinced /me/ that your Cauchy matrices
are the way to go.  I will want to study your code a bit more, and try
it out for myself, but it looks like you have a way to overcome the
limitations of the power sequence method without too big runtime costs -
and that is exactly what we need.

> I can just say that systems using multiple parity levels do exist, and
> maybe also the Linux Kernel could benefit to have such kind of support.

I certainly think so.  I think 3 parities is definitely useful, and
sometimes four would be nice.  Beyond that, I suspect "coolness" and
bragging rights (btrfs can support more parities than ZFS...) will
outweigh real-life implementations, so it is important that the
implementation does not sacrifice anything on the triple parity in order
to get 5+ parity support.  It's fine for /us/ to look at fun solutions,
but it needs to be practical too if it is going to be accepted in the
kernel.

mvh.,

David

> 
> Here some examples:
> 
> Oracle/Sun, Dell/Compellent ZFS: 3 parity drives
> NEC HydraStor: 3 parity drives
> EMC/Isilon: 4 parity drives
> Amplidata: 4 parity drives
> CleverSafe: 6 parity drives
> StreamScale/BigParity: 7 parity drives
> 
> And Btrfs with six parities would be surely cool :)
> 
> Ciao,
> Andrea
> 
> On Tue, Nov 19, 2013 at 11:16 AM, David Brown <david.brown@hesbynett.no> wrote:
>> On 19/11/13 00:25, H. Peter Anvin wrote:
>>> On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
>>>> Hi Peter,
>>>>
>>>> The Cauchy matrix has the mathematical property to always have itself
>>>> and all submatrices not singular. So, we are sure that we can always
>>>> solve the equations to recover the data disks.
>>>>
>>>> Besides the mathematical proof, I've also inverted all the
>>>> 377,342,351,231 possible submatrices for up to 6 parities and 251 data
>>>> disks, and got an experimental confirmation of this.
>>>>
>>>
>>> Nice.
>>>
>>>>
>>>> The only limit is coming from the GF(2^8). You have a maximum number
>>>> of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
>>>> you can have no more of 251 data disks. Over this limit it's not
>>>> possible to build a Cauchy matrix.
>>>>
>>>
>>> 251?  Not 255?
>>>
>>>> Note that instead with a Vandermonde matrix you don't have the
>>>> guarantee to always have all the submatrices not singular. This is the
>>>> reason because using power coefficients, before or late, it happens to
>>>> have unsolvable equations.
>>>>
>>>> You can find the code that generate the Cauchy matrix with some
>>>> explanation in the comments at (see the set_cauchy() function) :
>>>>
>>>> http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c
>>>
>>> OK, need to read up on the theoretical aspects of this, but it sounds
>>> promising.
>>>
>>>       -hpa
>>>
>>
>> Hi all,
>>
>> A while back I worked through the maths for a method of extending raid
>> to multiple parities, though I never got as far as implementing it in
>> code (other than some simple Python test code to confirm the maths).  It
>> is also missing the maths for simplified ways to recover data.  I've
>> posted a couple of times with this on the linux-raid mailing list (as
>> linked in this thread) - there has certainly been some interest, but
>> it's not easy to turn interest into hard work!
>>
>> I used an obvious expansion on the existing RAID5 and RAID6 algorithms,
>> with parity P_n being generated from powers of 2^n.  This means that the
>> triple-parity version can be implemented by simply applying the RAID6
>> operations twice.  For a triple parity, this works well - the matrices
>> involved are all invertible up to 255 data disks.  Beyond that, however,
>> things drop off rapidly - quad parity implemented in the same way only
>> supports 21 data disks, and for five parity disks you need to use 0x20
>> (skipping 0x10) to get even 8 data disks.
>>
>> This means that my method would be fine for triple parity, and would
>> also be efficient in implementation.
>>
>> Beyond triple parity, the simple method has size limits for four parity
>> and is no use on anything bigger.  The Cauchy matrix method lets us go
>> beyond that (I haven't yet studied your code and your maths - I will do
>> so as soon as I have the chance, but I doubt if that will be before the
>> weekend).
>>
>> Would it be possible to use the simple parity system for the first three
>> parities, and Cauchy beyond that?  That would give the best of both worlds.
>>
>>
>>
>> The important thing to think about here is what would actually be useful
>> in the real world.  It is always nice to have a system that can make an
>> array with 251 data disks and 6 parities (and I certainly think the
>> maths involved is fun), but would anyone use such a beast?
>>
>> Triple parity has clear use cases.  As people have moved up from raid5
>> to raid6, "raid7" or "raid6-3p" would be an obvious next step.  I also
>> see it as being useful for maintenance on raid6 arrays - if you want to
>> replace disks on a raid6 array you could first add a third parity disk
>> with an asymmetric layout, then you could replace the main disks while
>> keeping two disk redundancy at all times.
>>
>> Quad parity is unlikely, I think - you would need a very wide array and
>> unusual requirements to make quad parity a better choice than a layered
>> system of raid10 or raid15.  At most, I think it would find use as a
>> temporary security while maintaining a triple-raid array.  Remember also
>> that such an array would be painfully slow if it ever needed to rebuild
>> data with four missing disks - and if it is then too slow to be usable,
>> then quad parity is not a useful solution.
>>
>>
>> (Obviously anyone with /real/ experience with large arrays can give
>> better ideas here - I like the maths of multi-parity raid, but I will
>> not it for my small arrays.)
>>
>>
>>
>> Of course I will enjoy studying your maths here, and I'll try to give
>> some feedback on it.  But I think for implementation purposes, the
>> simple "powers of 4" generation of triple parity would be better than
>> using the Cauchy matrix - it is a clear step from the existing raid6,
>> and it can work fast on a wide variety of processors (people use ARMs
>> and other "small" cpus on raids, not just x86 with SSE3).  I believe
>> that would mean simpler code and fewer changes, which is always popular
>> with the kernel folk.
>>
>> However, if it is not possible to use Cauchy matrices to get four and
>> more parity while keeping the same first three parities, then the
>> balance changes and a decision needs to be made - do we (the Linux
>> kernel developers, the btrfs developers, and the users) want a simpler
>> system that is limited to triple parity (or quad parity with 21 + 4
>> disks), or do we want a more complex but more flexible system?
>>
>> Personally, I don't mind either way, as long as we get a good technical
>> solution.  And I'll do what I can to help with the maths in either case.
>>
>> David
>>
>>
> 

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20  1:23               ` John Williams
@ 2013-11-20 10:35                 ` David Brown
  0 siblings, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-20 10:35 UTC (permalink / raw)
  To: John Williams, Chris Murphy; +Cc: Linux RAID Mailing List, Btrfs BTRFS

On 20/11/13 02:23, John Williams wrote:
> On Tue, Nov 19, 2013 at 4:54 PM, Chris Murphy <lists@colorremedies.com>
> wrote:
>> If anything, I'd like to see two implementations of RAID 6 dual
>> parity. The existing implementation in the md driver and btrfs could
>> remain the default, but users could opt into Cauchy matrix based dual
>> parity which would then enable them an easy (and live) migration path
>> to triple parity and beyond.

Andrea's Cauchy matrix is compatible with the existing Raid6, so there
is no problem there.

I believe it would be a terrible idea to have an incompatible extension
- that would mean you could not have temporary extra parity drives with
asymmetrical layouts, which is something I see as a very useful feature.

> 
> Actually, my understanding is that Andrea's Cauchy matrix technique
> (call it C) is compatible with existing md RAID5 and RAID6 (call these
> A). It is only the non-SSSE3 triple-parity algorithm 2^-1 (call it B)
> that is incompatible with his Cauchy matrix technique.
> 
> So, you can have:
> 
> 1) A+B
> 
> or
> 
> 2) A+C
> 
> But you cannot have A+B+C

Yes, that's right.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-19 18:12 ` Piergiorgio Sartor
@ 2013-11-20 10:44   ` David Brown
  2013-11-20 21:59     ` Piergiorgio Sartor
  2013-11-20 21:38   ` Andrea Mazzoleni
  1 sibling, 1 reply; 104+ messages in thread
From: David Brown @ 2013-11-20 10:44 UTC (permalink / raw)
  To: Piergiorgio Sartor, Andrea Mazzoleni
  Cc: linux-raid, linux-btrfs, hpa, creamyfish

On 19/11/13 19:12, Piergiorgio Sartor wrote:
> On Mon, Nov 18, 2013 at 11:08:59PM +0100, Andrea Mazzoleni wrote:

<snip for brevity>

> 
> Hi Andrea,
> 
> great job, this was exactly what I was looking for.
> 
> Do you know if there is a "fast" way not to correct
> errors, but to find them?
> 
> In RAID-6 (as per raid6check) there is an easy way
> to verify where an HDD has incorrect data.
> 

I think the way to do that is just to generate the parity blocks from
the data blocks, and compare them to the existing parity blocks.

> I suspect, for each 2 parity block it should be
> possible to find 1 error (and if this is true, then
> quad parity is more attractive than triple one).
> 
> Furthermore, my second (of first) target would
> be something like: http://www.symform.com/blog/tag/raid-96/
> 
> Which uses 32 parities (out of 96 "disks").

I believe Andrea's matrix is extensible as long as you have no more than
257 disks in total.  A mere 32 parities should not be a problem :-)

mvh.,

David


> 
> Keep going!!!
> 
> bye,
> 
> pg
> 
>>
>> Ciao,
>> Andrea
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-19 20:29         ` Ric Wheeler
@ 2013-11-20 16:16           ` James Plank
  2013-11-20 19:05             ` Andrea Mazzoleni
  2013-11-21  1:28             ` Stan Hoeppner
  0 siblings, 2 replies; 104+ messages in thread
From: James Plank @ 2013-11-20 16:16 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Brown, David Smith

Hi all -- no real comments, except as I mentioned to Ric, my tutorial in FAST last February presents Reed-Solomon coding with Cauchy matrices, and then makes special note of the common pitfall of assuming that you can append a Vandermonde matrix to an identity matrix.  Please see http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf, slides 48-52.

Andrea, does the matrix that you included in an earlier mail (the one that has Linux RAID-6 in the first two rows) have a general form, or did you develop it in an ad hoc manner so that it would include Linux RAID-6 in the first two rows?

Best wishes -- Jim
----------

On Nov 19, 2013, at 3:29 PM, Ric Wheeler wrote:

> On 11/19/2013 12:28 PM, Andrea Mazzoleni wrote:
>> Hi Peter,
>> 
>> Yes, 251 disks for 6 parity.
>> 
>> To build a NxM Cauchy matrix you need to pick N+M distinct values
>> in the GF(2^8) and we have only 2^8 == 256 available.
>> This means that for every row we add for an extra parity level, we
>> have to remove one of the disk columns.
>> 
>> Note that in true, I use an Extended Cauchy matrix that gives the
>> first row of 1 for "free". This results in N+M <= 256+1.
>> So, DISKS = 257 - PARITY -> 251 = 257 - 6
>> 
>> A brief introduction of Cauchy and Extended Cauchy matrix can be found in:
>> 
>> Vinocha, "On Generator Cauchy Matrices of GDRS/GTRS Codes", 2012
>> http://www.m-hikari.com/ijcms/ijcms-2012/45-48-2012/brarIJCMS45-48-2012.pdf
>> (just check the Introduction, the rest is not related)
>> 
>> More details can be found in:
>> 
>> Roth, "Introduction to Coding Theory", 2006
>> http://carlossicoli.free.fr/R/Roth_R.-Introduction_to_coding_theory-Cambridge_University_Press%282006%29.pdf
>> (search for "Extended Cauchy")
>> 
>> Ciao,
>> Andrea
> 
> Great work - we have waited a long time for this. Adding in Jim Plank who did some great talks and work in this area as well :)
> 
> Ric
> 
>> 
>> On Tue, Nov 19, 2013 at 12:25 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
>>>> Hi Peter,
>>>> 
>>>> The Cauchy matrix has the mathematical property to always have itself
>>>> and all submatrices not singular. So, we are sure that we can always
>>>> solve the equations to recover the data disks.
>>>> 
>>>> Besides the mathematical proof, I've also inverted all the
>>>> 377,342,351,231 possible submatrices for up to 6 parities and 251 data
>>>> disks, and got an experimental confirmation of this.
>>>> 
>>> Nice.
>>> 
>>>> The only limit is coming from the GF(2^8). You have a maximum number
>>>> of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
>>>> you can have no more of 251 data disks. Over this limit it's not
>>>> possible to build a Cauchy matrix.
>>>> 
>>> 251?  Not 255?
>>> 
>>>> Note that instead with a Vandermonde matrix you don't have the
>>>> guarantee to always have all the submatrices not singular. This is the
>>>> reason because using power coefficients, before or late, it happens to
>>>> have unsolvable equations.
>>>> 
>>>> You can find the code that generate the Cauchy matrix with some
>>>> explanation in the comments at (see the set_cauchy() function) :
>>>> 
>>>> http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c
>>> OK, need to read up on the theoretical aspects of this, but it sounds
>>> promising.
>>> 
>>>         -hpa
>>> 
>>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 10:31           ` David Brown
@ 2013-11-20 18:09             ` John Williams
  2013-11-20 18:44               ` Andrea Mazzoleni
  2013-11-21  8:32               ` David Brown
  2013-11-20 18:34             ` Andrea Mazzoleni
  1 sibling, 2 replies; 104+ messages in thread
From: John Williams @ 2013-11-20 18:09 UTC (permalink / raw)
  To: David Brown
  Cc: Andrea Mazzoleni, H. Peter Anvin, Linux RAID Mailing List,
	Btrfs BTRFS, David Smith

On Wed, Nov 20, 2013 at 2:31 AM, David Brown <david.brown@hesbynett.no> wrote:
> That's certainly a reasonable way to look at it.  We should not limit
> the possibilities for high-end systems because of the limitations of
> low-end systems that are unlikely to use 3+ parity anyway.  I've also
> looked up a list of the processors that support SSE3 and PSHUFB - a lot
> of modern "low-end" x86 cpus support it.  And of course it is possible
> to implement general G(2^8) multiplication without PSHUFB, using a
> lookup table - it is important that this can all work with any CPU, even
> if it is slow.

Unfortunately, it is SSSE3 that is required for PSHUFB. The SSE3 set
with only two-esses does not suffice. I made that same mistake when I
first heard about Andrea's 6-parity work. SSSE3 vs. SSE3, confusing
notation!

SSSE3 is significantly less widely supported than SSE3. Particularly
on AMD, only the very latest CPUs seem to support SSSE3. Intel support
for SSSE3 goes back much further than AMD support.

Maybe it is not such a big problem, since it may be possible to
support two "roads". Both roads would include the current md RAID-5
and RAID-6. But one road, which those lacking CPUs supporting SSSE3
might choose, would continue on to the non-SSSE3 triple-parity 2^-1
technique, and then dead-end. The other road would continue with the
Cauchy matrix technique through 3-parity all the way to 6-parity.

It might even be feasible to allow someone stuck at the end of the
non-SSSE3 road to convert to the Cauchy road. You would have to go
through all the 2^-1 triple-parity and convert it to Cauchy
triple-parity. But then you would be safely on the Cauchy road.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 10:31           ` David Brown
  2013-11-20 18:09             ` John Williams
@ 2013-11-20 18:34             ` Andrea Mazzoleni
  2013-11-20 18:43               ` H. Peter Anvin
  2013-11-21  8:36               ` David Brown
  1 sibling, 2 replies; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-20 18:34 UTC (permalink / raw)
  To: David Brown; +Cc: H. Peter Anvin, linux-raid, linux-btrfs, David Smith

Hi David,

>> The choice of ZFS to use powers of 4 was likely not optimal,
>> because to multiply by 4, it has to do two multiplications by 2.
> I can agree with that.  I didn't copy ZFS's choice here
David, it was not my intention to suggest that you copied from ZFS.
Sorry to have expressed myself badly. I just mentioned ZFS because it's
an implementation that I know uses powers of 4 to generate triple
parity, and I saw in the code that it's implemented with two multiplication
by 2.

Ciao,
Andrea

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 18:34             ` Andrea Mazzoleni
@ 2013-11-20 18:43               ` H. Peter Anvin
  2013-11-20 18:56                 ` Andrea Mazzoleni
  2013-11-21  8:36               ` David Brown
  1 sibling, 1 reply; 104+ messages in thread
From: H. Peter Anvin @ 2013-11-20 18:43 UTC (permalink / raw)
  To: Andrea Mazzoleni, David Brown; +Cc: linux-raid, linux-btrfs, David Smith

It is also possible to quickly multiply by 2^-1 which makes for an interesting R parity.

Andrea Mazzoleni <amadvance@gmail.com> wrote:
>Hi David,
>
>>> The choice of ZFS to use powers of 4 was likely not optimal,
>>> because to multiply by 4, it has to do two multiplications by 2.
>> I can agree with that.  I didn't copy ZFS's choice here
>David, it was not my intention to suggest that you copied from ZFS.
>Sorry to have expressed myself badly. I just mentioned ZFS because it's
>an implementation that I know uses powers of 4 to generate triple
>parity, and I saw in the code that it's implemented with two
>multiplication
>by 2.
>
>Ciao,
>Andrea

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 18:09             ` John Williams
@ 2013-11-20 18:44               ` Andrea Mazzoleni
  2013-11-21  6:15                 ` Stan Hoeppner
  2013-11-21  8:32               ` David Brown
  1 sibling, 1 reply; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-20 18:44 UTC (permalink / raw)
  To: John Williams
  Cc: David Brown, H. Peter Anvin, Linux RAID Mailing List, Btrfs BTRFS,
	David Smith

Hi John,

Yes. There are still AMD CPUs sold without SSSE3. Most notably Athlon.
Instead, Intel is providing SSSE3 from the Core 2 Duo.

A detailed list is available at: http://en.wikipedia.org/wiki/SSSE3

Ciao,
Andrea

On Wed, Nov 20, 2013 at 7:09 PM, John Williams <jwilliams4200@gmail.com> wrote:
> On Wed, Nov 20, 2013 at 2:31 AM, David Brown <david.brown@hesbynett.no> wrote:
>> That's certainly a reasonable way to look at it.  We should not limit
>> the possibilities for high-end systems because of the limitations of
>> low-end systems that are unlikely to use 3+ parity anyway.  I've also
>> looked up a list of the processors that support SSE3 and PSHUFB - a lot
>> of modern "low-end" x86 cpus support it.  And of course it is possible
>> to implement general G(2^8) multiplication without PSHUFB, using a
>> lookup table - it is important that this can all work with any CPU, even
>> if it is slow.
>
> Unfortunately, it is SSSE3 that is required for PSHUFB. The SSE3 set
> with only two-esses does not suffice. I made that same mistake when I
> first heard about Andrea's 6-parity work. SSSE3 vs. SSE3, confusing
> notation!
>
> SSSE3 is significantly less widely supported than SSE3. Particularly
> on AMD, only the very latest CPUs seem to support SSSE3. Intel support
> for SSSE3 goes back much further than AMD support.
>
> Maybe it is not such a big problem, since it may be possible to
> support two "roads". Both roads would include the current md RAID-5
> and RAID-6. But one road, which those lacking CPUs supporting SSSE3
> might choose, would continue on to the non-SSSE3 triple-parity 2^-1
> technique, and then dead-end. The other road would continue with the
> Cauchy matrix technique through 3-parity all the way to 6-parity.
>
> It might even be feasible to allow someone stuck at the end of the
> non-SSSE3 road to convert to the Cauchy road. You would have to go
> through all the 2^-1 triple-parity and convert it to Cauchy
> triple-parity. But then you would be safely on the Cauchy road.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 18:43               ` H. Peter Anvin
@ 2013-11-20 18:56                 ` Andrea Mazzoleni
  2013-11-20 18:59                   ` H. Peter Anvin
  2013-11-20 19:00                   ` H. Peter Anvin
  0 siblings, 2 replies; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-20 18:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: David Brown, Linux RAID Mailing List, Btrfs BTRFS, David Smith

Hi,

Yep. At present to multiply for 2^-1 I'm using in C:

static inline uint64_t d2_64(uint64_t v)
{
        uint64_t mask = v & 0x0101010101010101U;
        mask = (mask << 8) - mask;
        v = (v >> 1) & 0x7f7f7f7f7f7f7f7fU;
        v ^= mask & 0x8e8e8e8e8e8e8e8eU;
        return v;
}

and for SSE2:

asm volatile("movdqa %xmm2,%xmm4");
asm volatile("pxor %xmm5,%xmm5");
asm volatile("psllw $7,%xmm4");
asm volatile("psrlw $1,%xmm2");
asm volatile("pcmpgtb %xmm4,%xmm5");
asm volatile("pand %xmm6,%xmm2"); with xmm6 == 7f7f7f7f7f7f...
asm volatile("pand %xmm3,%xmm5"); with xmm3 == 8e8e8e8e8e...
asm volatile("pxor %xmm5,%xmm2");

where xmm2 is the intput/output

Ciao,
Andrea

On Wed, Nov 20, 2013 at 7:43 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> It is also possible to quickly multiply by 2^-1 which makes for an interesting R parity.
>
> Andrea Mazzoleni <amadvance@gmail.com> wrote:
>>Hi David,
>>
>>>> The choice of ZFS to use powers of 4 was likely not optimal,
>>>> because to multiply by 4, it has to do two multiplications by 2.
>>> I can agree with that.  I didn't copy ZFS's choice here
>>David, it was not my intention to suggest that you copied from ZFS.
>>Sorry to have expressed myself badly. I just mentioned ZFS because it's
>>an implementation that I know uses powers of 4 to generate triple
>>parity, and I saw in the code that it's implemented with two
>>multiplication
>>by 2.
>>
>>Ciao,
>>Andrea
>
> --
> Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 18:56                 ` Andrea Mazzoleni
@ 2013-11-20 18:59                   ` H. Peter Anvin
  2013-11-20 21:21                     ` Andrea Mazzoleni
  2013-11-20 19:00                   ` H. Peter Anvin
  1 sibling, 1 reply; 104+ messages in thread
From: H. Peter Anvin @ 2013-11-20 18:59 UTC (permalink / raw)
  To: Andrea Mazzoleni
  Cc: David Brown, Linux RAID Mailing List, Btrfs BTRFS, David Smith

On 11/20/2013 10:56 AM, Andrea Mazzoleni wrote:
> Hi,
> 
> Yep. At present to multiply for 2^-1 I'm using in C:
> 
> static inline uint64_t d2_64(uint64_t v)
> {
>         uint64_t mask = v & 0x0101010101010101U;
>         mask = (mask << 8) - mask;
>         v = (v >> 1) & 0x7f7f7f7f7f7f7f7fU;
>         v ^= mask & 0x8e8e8e8e8e8e8e8eU;
>         return v;
> }
> 
> and for SSE2:
> 
> asm volatile("movdqa %xmm2,%xmm4");
> asm volatile("pxor %xmm5,%xmm5");
> asm volatile("psllw $7,%xmm4");
> asm volatile("psrlw $1,%xmm2");
> asm volatile("pcmpgtb %xmm4,%xmm5");
> asm volatile("pand %xmm6,%xmm2"); with xmm6 == 7f7f7f7f7f7f...
> asm volatile("pand %xmm3,%xmm5"); with xmm3 == 8e8e8e8e8e...
> asm volatile("pxor %xmm5,%xmm2");
> 
> where xmm2 is the intput/output
> 

Now, that doesn't sound like something that can get neatly meshed into
the Cauchy matrix scheme, I assume.  It is somewhat nice to have a
scheme which is arbitrarily expandable without having to fall back to
dual parity during the restripe operation.  It probably also reduces the
amount of code necessary.

	-hpa



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 18:56                 ` Andrea Mazzoleni
  2013-11-20 18:59                   ` H. Peter Anvin
@ 2013-11-20 19:00                   ` H. Peter Anvin
  2013-11-20 21:04                     ` Andrea Mazzoleni
  1 sibling, 1 reply; 104+ messages in thread
From: H. Peter Anvin @ 2013-11-20 19:00 UTC (permalink / raw)
  To: Andrea Mazzoleni
  Cc: David Brown, Linux RAID Mailing List, Btrfs BTRFS, David Smith

On 11/20/2013 10:56 AM, Andrea Mazzoleni wrote:
> Hi,
> 
> Yep. At present to multiply for 2^-1 I'm using in C:
> 
> static inline uint64_t d2_64(uint64_t v)
> {
>         uint64_t mask = v & 0x0101010101010101U;
>         mask = (mask << 8) - mask;

(mask << 7) I assume...


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 16:16           ` James Plank
@ 2013-11-20 19:05             ` Andrea Mazzoleni
  2013-11-20 19:10               ` H. Peter Anvin
  2013-11-21  1:28             ` Stan Hoeppner
  1 sibling, 1 reply; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-20 19:05 UTC (permalink / raw)
  To: James Plank
  Cc: Ric Wheeler, H. Peter Anvin, Linux RAID Mailing List, Btrfs BTRFS,
	David Brown, David Smith

Hi Jim,

I build the matrix in a way that results in coefficients matching
Linux RAID for the first two rows, and at the same time gives
the guarantee that all the square submatrices are not singular,
resulting in a MDS code.

I start forming a Cauchy matrix setting each element to 1/(xi+yj)
where all xi and yj are distinct elements. This is how a Cauchy
matrix is usually defined in textbooks.

For the first row with j=0, I use xi = 2^-i and y0 = 0, that results in:

row j=0 -> 1/(xi+y0) = 1/(2^-i + 0) = 2^i (RAID-6 coefficients)

For the next rows with j>0, I use yj = 2^j, resulting in:

rows j>0 -> 1/(xi+yj) = 1/(2^-i + 2^j)

with xi != yj for any i,j with i>=0,j>=1,i+j<255

Then I put at the top of the Cauchy matrix a row filled with 1,
transforming it in an Extended Cauchy Matrix.
This transformation maintains the property of having all the
square submatrices not singular.
I found this property mentioned in some papers/textbooks, like
in the introduction of:

Vinocha, On Generator Cauchy Matrices of GDRS/GTRS Codes, 2012
http://www.m-hikari.com/ijcms/ijcms-2012/45-48-2012/brarIJCMS45-48-2012.pdf

Finally I adjust all the rows to have the first column filled with 1,
with a multiplication of each row for an adjusting factor.
Also this transformation maintains the property of having all the
square submatrices not singular, and then we have a MDS code.

Ciao,
Andrea

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 19:05             ` Andrea Mazzoleni
@ 2013-11-20 19:10               ` H. Peter Anvin
  2013-11-20 20:30                 ` James Plank
  0 siblings, 1 reply; 104+ messages in thread
From: H. Peter Anvin @ 2013-11-20 19:10 UTC (permalink / raw)
  To: Andrea Mazzoleni, James Plank
  Cc: Ric Wheeler, Linux RAID Mailing List, Btrfs BTRFS, David Brown,
	David Smith

On 11/20/2013 11:05 AM, Andrea Mazzoleni wrote:
> 
> For the first row with j=0, I use xi = 2^-i and y0 = 0, that results in:
> 

How can xi = 2^-i if x is supposed to be constant?

That doesn't mean that your approach isn't valid, of course, but it
might not be a Cauchy matrix and thus needs additional analysis.

> row j=0 -> 1/(xi+y0) = 1/(2^-i + 0) = 2^i (RAID-6 coefficients)
> 
> For the next rows with j>0, I use yj = 2^j, resulting in:
> 
> rows j>0 -> 1/(xi+yj) = 1/(2^-i + 2^j)

Even more so here... 2^-i and 2^j don't seem to be of the form xi and yj
respectively.

	-hpa


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 19:10               ` H. Peter Anvin
@ 2013-11-20 20:30                 ` James Plank
  2013-11-20 21:23                   ` Andrea Mazzoleni
  2013-11-20 21:28                   ` H. Peter Anvin
  0 siblings, 2 replies; 104+ messages in thread
From: James Plank @ 2013-11-20 20:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrea Mazzoleni, Ric Wheeler, Linux RAID Mailing List,
	Btrfs BTRFS, David Brown, David Smith

Peter, I think I understand it differently.  Concrete example in GF(256) for k=6, m=4:

First, create a 3 by 6 cauchy matrix, using x_i = 2^-i, and y_i = 0 for i=0, and y_i = 2^i for other i.  In this case:   x = { 1, 142, 71, 173, 216, 108 }  y = { 0, 2, 4).  The cauchy matrix is:

  1   2   4   8  16  32
244  83  78 183 118  47
167  39 213  59 153  82

Divide row 2 by 244 and row 3 by 167.  Then extend it with a row of ones on top and it's still MDS, and that's the code for m=4, with RAID-6 as a subset.  Very nice!  

Jim

----------

On Nov 20, 2013, at 2:10 PM, H. Peter Anvin wrote:

> On 11/20/2013 11:05 AM, Andrea Mazzoleni wrote:
>> 
>> For the first row with j=0, I use xi = 2^-i and y0 = 0, that results in:
>> 
> 
> How can xi = 2^-i if x is supposed to be constant?
> 
> That doesn't mean that your approach isn't valid, of course, but it
> might not be a Cauchy matrix and thus needs additional analysis.
> 
>> row j=0 -> 1/(xi+y0) = 1/(2^-i + 0) = 2^i (RAID-6 coefficients)
>> 
>> For the next rows with j>0, I use yj = 2^j, resulting in:
>> 
>> rows j>0 -> 1/(xi+yj) = 1/(2^-i + 2^j)
> 
> Even more so here... 2^-i and 2^j don't seem to be of the form xi and yj
> respectively.
> 
> 	-hpa
> 


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 19:00                   ` H. Peter Anvin
@ 2013-11-20 21:04                     ` Andrea Mazzoleni
  2013-11-20 21:06                       ` H. Peter Anvin
  0 siblings, 1 reply; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-20 21:04 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: David Brown, Linux RAID Mailing List, Btrfs BTRFS, David Smith

Hi Peter,

>> static inline uint64_t d2_64(uint64_t v)
>> {
>>         uint64_t mask = v & 0x0101010101010101U;
>>         mask = (mask << 8) - mask;
>
> (mask << 7) I assume...
No. It's "(mask << 8) - mask". We want to expand the bit at position 0
(in each byte) to the full byte, resulting in 0xFF if the bit is at 1,
and 0x00 if the bit is 0.

(0 << 8) - 0 = 0x00
(1 << 8) - 1 = 0x100 - 1 = 0xFF

Ciao,
Andrea

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 21:04                     ` Andrea Mazzoleni
@ 2013-11-20 21:06                       ` H. Peter Anvin
  0 siblings, 0 replies; 104+ messages in thread
From: H. Peter Anvin @ 2013-11-20 21:06 UTC (permalink / raw)
  To: Andrea Mazzoleni
  Cc: David Brown, Linux RAID Mailing List, Btrfs BTRFS, David Smith

On 11/20/2013 01:04 PM, Andrea Mazzoleni wrote:
> Hi Peter,
> 
>>> static inline uint64_t d2_64(uint64_t v)
>>> {
>>>         uint64_t mask = v & 0x0101010101010101U;
>>>         mask = (mask << 8) - mask;
>>
>> (mask << 7) I assume...
> No. It's "(mask << 8) - mask". We want to expand the bit at position 0
> (in each byte) to the full byte, resulting in 0xFF if the bit is at 1,
> and 0x00 if the bit is 0.
> 
> (0 << 8) - 0 = 0x00
> (1 << 8) - 1 = 0x100 - 1 = 0xFF
> 

Oh, right... it is the same as (v << 1) - (v >> 7) except everything is
shifted over one.

	-hpa



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 18:59                   ` H. Peter Anvin
@ 2013-11-20 21:21                     ` Andrea Mazzoleni
  0 siblings, 0 replies; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-20 21:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: David Brown, Linux RAID Mailing List, Btrfs BTRFS, David Smith

Hi Peter,

> Now, that doesn't sound like something that can get neatly meshed into
> the Cauchy matrix scheme, I assume.
You are correct. Multiplication by 2^-1 cannot be used for the Cauchy method.

I used it to implement an alternate triple parity not requiring PSHUFB
that I used as reference for performance evaluation of the Cauchy way,
assuming that this implementation using powers of 1,2,2^-1 is the
fastest possible one.

Hopefully the difference is minimal, and the Cauchy method is
competitive even at triple parity.

Ciao,
Andrea

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 20:30                 ` James Plank
@ 2013-11-20 21:23                   ` Andrea Mazzoleni
  2013-11-27  2:50                     ` ronnie sahlberg
  2013-11-20 21:28                   ` H. Peter Anvin
  1 sibling, 1 reply; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-20 21:23 UTC (permalink / raw)
  To: James Plank
  Cc: H. Peter Anvin, Ric Wheeler, Linux RAID Mailing List, Btrfs BTRFS,
	David Brown, David Smith

Hi,

> First, create a 3 by 6 cauchy matrix, using x_i = 2^-i, and y_i = 0 for i=0, and y_i = 2^i for other i.
> In this case:   x = { 1, 142, 71, 173, 216, 108 }  y = { 0, 2, 4).  The cauchy matrix is:
>
>   1   2   4   8  16  32
> 244  83  78 183 118  47
> 167  39 213  59 153  82
>
> Divide row 2 by 244 and row 3 by 167.  Then extend it with a row of ones on top and it's still MDS,
> and  that's the code for m=4, with RAID-6 as a subset.  Very nice!

You got it Jim!

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 20:30                 ` James Plank
  2013-11-20 21:23                   ` Andrea Mazzoleni
@ 2013-11-20 21:28                   ` H. Peter Anvin
  1 sibling, 0 replies; 104+ messages in thread
From: H. Peter Anvin @ 2013-11-20 21:28 UTC (permalink / raw)
  To: James Plank
  Cc: Andrea Mazzoleni, Ric Wheeler, Linux RAID Mailing List,
	Btrfs BTRFS, David Brown, David Smith

On 11/20/2013 12:30 PM, James Plank wrote:
> Peter, I think I understand it differently.  Concrete example in GF(256) for k=6, m=4:
> 
> First, create a 3 by 6 cauchy matrix, using x_i = 2^-i, and y_i = 0 for i=0, and y_i = 2^i for other i.  In this case:   x = { 1, 142, 71, 173, 216, 108 }  y = { 0, 2, 4).  The cauchy matrix is:

Sorry, I took xi and yj to mean a constant x multiplied with i and a
constant y multiplied with j, rather than x_i and y_j.

	-hpa


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-19 18:12 ` Piergiorgio Sartor
  2013-11-20 10:44   ` David Brown
@ 2013-11-20 21:38   ` Andrea Mazzoleni
  1 sibling, 0 replies; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-20 21:38 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Linux RAID Mailing List, Btrfs BTRFS, H. Peter Anvin, David Brown,
	David Smith

Hi Piergiorgio,

> In RAID-6 (as per raid6check) there is an easy way
> to verify where an HDD has incorrect data.
> I suspect, for each 2 parity block it should be
> possible to find 1 error (and if this is true, then
> quad parity is more attractive than triple one).
Yes. The theory say that with quad parity is possible to find 2
errors. Or find 1 error when running in degraded mode with 2 missing
disks.

But yep, the problem is how to do in a fast way. I don't know enough
the field to answer this.

Ciao,
Andrea

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 10:44   ` David Brown
@ 2013-11-20 21:59     ` Piergiorgio Sartor
  2013-11-21 10:13       ` David Brown
  0 siblings, 1 reply; 104+ messages in thread
From: Piergiorgio Sartor @ 2013-11-20 21:59 UTC (permalink / raw)
  To: David Brown
  Cc: Piergiorgio Sartor, Andrea Mazzoleni, linux-raid, linux-btrfs,
	hpa, creamyfish

On Wed, Nov 20, 2013 at 11:44:39AM +0100, David Brown wrote:
[...]
> > In RAID-6 (as per raid6check) there is an easy way
> > to verify where an HDD has incorrect data.
> > 
> 
> I think the way to do that is just to generate the parity blocks from
> the data blocks, and compare them to the existing parity blocks.

Uhm, the generic RS decoder should try all
the possible combination of erasure and so
detect the error.
This is unfeasible already with 3 parities,
so there are faster algorithms, I believe:

Peterson–Gorenstein–Zierler algorithm
Berlekamp–Massey algorithm

Nevertheless, I do not know too much about
those, so I cannot state if they apply to
the Cauchy matrix as explained here.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-18 22:08 Andrea Mazzoleni
  2013-11-18 22:12 ` H. Peter Anvin
  2013-11-19 18:12 ` Piergiorgio Sartor
@ 2013-11-20 22:29 ` Piergiorgio Sartor
  2013-11-23  7:55   ` Andrea Mazzoleni
  2 siblings, 1 reply; 104+ messages in thread
From: Piergiorgio Sartor @ 2013-11-20 22:29 UTC (permalink / raw)
  To: Andrea Mazzoleni; +Cc: linux-raid, linux-btrfs, hpa, david.brown, creamyfish

On Mon, Nov 18, 2013 at 11:08:59PM +0100, Andrea Mazzoleni wrote:
[...]

I've a side question, a bit OT, but maybe you
could help with the answer.

How about par2? How does this work?
They claim "Vendermonde" matrix and they seem
to be quite flexible in amount of parities.
The could be in GF(2^16), so maybe this makes
a difference...

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 16:16           ` James Plank
  2013-11-20 19:05             ` Andrea Mazzoleni
@ 2013-11-21  1:28             ` Stan Hoeppner
  2013-11-21  2:46               ` John Williams
                                 ` (3 more replies)
  1 sibling, 4 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-21  1:28 UTC (permalink / raw)
  To: James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Brown, David Smith

On 11/20/2013 10:16 AM, James Plank wrote:
> Hi all -- no real comments, except as I mentioned to Ric, my tutorial
> in FAST last February presents Reed-Solomon coding with Cauchy
> matrices, and then makes special note of the common pitfall of
> assuming that you can append a Vandermonde matrix to an identity
> matrix.  Please see
> http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf,
> slides 48-52.
> 
> Andrea, does the matrix that you included in an earlier mail (the one
> that has Linux RAID-6 in the first two rows) have a general form, or
> did you develop it in an ad hoc manner so that it would include Linux
> RAID-6 in the first two rows?

Hello Jim,

It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
today. ;)

I'm not attempting to marginalize Andrea's work here, but I can't help
but ponder what the real value of triple parity RAID is, or quad, or
beyond.  Some time ago parity RAID's primary mission ceased to be
surviving single drive failure, or a 2nd failure during rebuild, and
became mitigating UREs during a drive rebuild.  So we're now talking
about dedicating 3 drives of capacity to avoiding disaster due to
platter defects and secondary drive failure.  For small arrays this is
approaching half the array capacity.  So here parity RAID has lost the
battle with RAID10's capacity disadvantage, yet it still suffers the
vastly inferior performance in normal read/write IO, not to mention
rebuild times that are 3-10x longer.

WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
to mirror a drive at full streaming bandwidth, assuming 300MB/s
average--and that is probably being kind to the drive makers.  With 6 or
8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
minimum 72 hours or more, probably over 100, and probably more yet for
3P.  And with larger drive count arrays the rebuild times approach a
week.  Whose users can go a week with degraded performance?  This is
simply unreasonable, at best.  I say it's completely unacceptable.

With these gargantuan drives coming soon, the probability of multiple
UREs during rebuild are pretty high.  Continuing to use ever more
complex parity RAID schemes simply increases rebuild time further.  The
longer the rebuild, the more likely a subsequent drive failure due to
heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
one failure mode we're increasing the probability of another.  TANSTAFL.
 Worse yet, RAID10 isn't going to survive because UREs on a single drive
are increasingly likely with these larger drives, and one URE during
rebuild destroys the array.

I think people are going to have to come to grips with using more and
more drives simply to brace the legs holding up their arrays; comes to
grips with these insane rebuild times; or bite the bullet they so
steadfastly avoided with RAID10.  Lots more spindles solves problems,
but at a greater cost--again, no free lunch.

What I envision is an array type, something similar to RAID 51, i.e.
striped parity over mirror pairs.  In the case of Linux, this would need
to be a new distinct md/RAID level, as both the RAID5 and RAID1 code
would need enhancement before being meshed together into this new level[1].

Potential Advantages:

1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count
2.  Rebuild time is the same as RAID 10, unless a mirror pair is lost
3.  Parity is only used during rebuild if/when a URE occurs, unless ^
4.  Single drive failure doesn't degrade the parity array, multiple
    failures in different mirrors doesn't degrade the parity array
5.  Can sustain a minimum of 3 simultaneous drive failures--both drives
    in one mirror and one drive in another mirror
6.  Can lose a maximum of 1/2 of the drives plus 1 drive--one more than
    RAID 10.  Can lose half the drives and still not degrade parity,
    if no two comprise one mirror
7.  Similar or possibly better read throughput vs triple parity RAID
8.  Superior write performance with drives down
9.  Vastly superior rebuild performance, as rebuilds will rarely, if
    ever, involve parity

Potential Disadvantages:

1.  +1 disk overhead vs RAID 10, many more than 2/3P w/large arrays
2.  Read-modify-write penalty vs RAID 10
3.  Slower write throughput vs triple parity RAID due to spindle deficit
4.  Development effort
5.  ??

[1]  The RAID1/5 code would need to be patched to properly handle a URE
encountered by the RAID1 code during rebuild.  There are surely other
modifications and/or optimizations that would be needed.  For large
sequential reads, more deterministic read interleaving between mirror
pairs would be a good candidate I think.  IIUC the RAID1 driver does
read interleaving on a per thread basis or some such, which I don't
believe is going to work for this "RAID 51" scenario, at least not for
single streaming reads.  If this can be done well, we double the read
performance of RAID5, and thus we don't completely "waste" all the extra
disks vs big_parity schemes.

This proposed "RAID level 51" should have drastically lower rebuild
times vs traditional striped parity, should not suffer read/write
performance degradation with most disk failure scenarios, and with a
read interleaving optimization may have significantly greater streaming
read throughput as well.

This is far from a perfect solution and I am certainly not promoting it
as such.  But I think it does have some serious advantages over
traditional striped parity schemes, and at minimum is worth discussion
as a counterpoint of sorts.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  1:28             ` Stan Hoeppner
@ 2013-11-21  2:46               ` John Williams
  2013-11-21  6:52                 ` Stan Hoeppner
  2013-11-21  8:08               ` joystick
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 104+ messages in thread
From: John Williams @ 2013-11-21  2:46 UTC (permalink / raw)
  To: stan
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

For myself or any machines I managed for work that do not need high
IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
similar schemes with arrays of 16 - 32 drives.

No need to go into detail here on a subject Adam Leventhal has already
covered in detail in an article "Triple-Parity RAID and Beyond" which
seems to match the subject of this thread quite nicely:

http://queue.acm.org/detail.cfm?id=1670144

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 18:44               ` Andrea Mazzoleni
@ 2013-11-21  6:15                 ` Stan Hoeppner
  0 siblings, 0 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-21  6:15 UTC (permalink / raw)
  To: Andrea Mazzoleni, John Williams
  Cc: David Brown, H. Peter Anvin, Linux RAID Mailing List, Btrfs BTRFS,
	David Smith

On 11/20/2013 12:44 PM, Andrea Mazzoleni wrote:

> Yes. There are still AMD CPUs sold without SSSE3. Most notably Athlon.
> Instead, Intel is providing SSSE3 from the Core 2 Duo.

I hate branding discontinuity, due to the resulting confusion...

Athlon, Athlon64, Athlon64 X2, Athlon X2 (K10), Athlon II X2, Athlon X2
(Piledriver).  Anyone confused?

The Trinity and Richland core "Athlon X2" and "Athlon X4" branded
processors certainly do support SSSE3, as well as SSE4, AVX, etc.  These
are the dual/quad core APUs whose graphics cores don't pass QC and are
surgically disabled.  AMD decided to brand them as "Athlon" processors.
 Available since ~2011.  For example:

http://www.cpu-world.com/CPUs/Bulldozer/AMD-Athlon%20X2%20370K%20-%20AD370KOKA23HL%20-%20AD370KOKHLBOX.html

The "Athlon II X2/X3/X4" processors have been out of production for a
couple of years now, but a scant few might still be found for sale in
the channel.  The X2 is based on the clean sheet Regor dual core 45nm
design.  The X3 and X4 are Phenom II rejects with various numbers of
defective cores and defective L3 caches.  None support SSSE3.

To say "there are still AMD CPUs sold without SSSE3... Most notably
Athlon" may be technically true if some Athlon II stragglers exist in
the channel.  But it isn't really a fair statement of today's reality.
AMD hasn't manufactured a CPU without SSSE3 for a couple of years now.
And few, if any, Athlon II X2/3/4 chips lacking SSSE3 are for sale.
Though there are certainly many such chips still in deployed desktop
machines.

> A detailed list is available at: http://en.wikipedia.org/wiki/SSSE3

Never trust Wikipedia articles to be complete and up to date.  However,
it does mention Athlon X2 and X4 as planned future product in the
Piledriver lineup.  Obviously this should be updated to past tense.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  2:46               ` John Williams
@ 2013-11-21  6:52                 ` Stan Hoeppner
  2013-11-21  7:05                   ` John Williams
  0 siblings, 1 reply; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-21  6:52 UTC (permalink / raw)
  To: John Williams
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

On 11/20/2013 8:46 PM, John Williams wrote:
> For myself or any machines I managed for work that do not need high
> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
> similar schemes with arrays of 16 - 32 drives.

You must see a week long rebuild as acceptable...

> No need to go into detail here 

I disagree.

> on a subject Adam Leventhal has already
> covered in detail in an article "Triple-Parity RAID and Beyond" which
> seems to match the subject of this thread quite nicely:
> 
> http://queue.acm.org/detail.cfm?id=1670144

Mr. Leventhal did not address the overwhelming problem we face, which is
(multiple) parity array reconstruction time.  He assumes the time to
simply 'populate' one drive at its max throughput is the total
reconstruction time for the array.  While this is typically true for
mirror based arrays, it is clearly not for parity arrays.

The primary focus of my comments was reducing rebuild time, thus
increasing overall reliability.  RAID 51 or something similar would
achieve this.  Thus I think we should discuss alternatives to multiple
parity in detail.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  6:52                 ` Stan Hoeppner
@ 2013-11-21  7:05                   ` John Williams
  2013-11-21 22:57                     ` Stan Hoeppner
  0 siblings, 1 reply; 104+ messages in thread
From: John Williams @ 2013-11-21  7:05 UTC (permalink / raw)
  To: stan
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 11/20/2013 8:46 PM, John Williams wrote:
>> For myself or any machines I managed for work that do not need high
>> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
>> similar schemes with arrays of 16 - 32 drives.
>
> You must see a week long rebuild as acceptable...

It would not be a problem if it did take that long, since I would have
extra parity units as backup in case of a failure during a rebuild.

But of course it would not take that long. Take, for example, a 24 x
3TB triple-parity array (21+3) that has had two drive failures
(perhaps the rebuild started with one failure, but there was soon
another failure). I would expect the rebuild to take about a day.

>> on a subject Adam Leventhal has already
>> covered in detail in an article "Triple-Parity RAID and Beyond" which
>> seems to match the subject of this thread quite nicely:
>>
>> http://queue.acm.org/detail.cfm?id=1670144
>
> Mr. Leventhal did not address the overwhelming problem we face, which is
> (multiple) parity array reconstruction time.  He assumes the time to
> simply 'populate' one drive at its max throughput is the total
> reconstruction time for the array.

Since Adam wrote the code for RAID-Z3 for ZFS, I'm sure he is aware of
the time to restore data to failed drives. I do not see any flaw in
his analysis related to the time needed to restore data to failed
drives.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  1:28             ` Stan Hoeppner
  2013-11-21  2:46               ` John Williams
@ 2013-11-21  8:08               ` joystick
  2013-11-22  0:30                 ` Stan Hoeppner
  2013-11-21  9:07               ` David Brown
  2013-11-21 19:56               ` Piergiorgio Sartor
  3 siblings, 1 reply; 104+ messages in thread
From: joystick @ 2013-11-21  8:08 UTC (permalink / raw)
  To: stan
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	linux-raid, linux-btrfs, David Brown, David Smith

On 21/11/2013 02:28, Stan Hoeppner wrote:
> On 11/20/2013 10:16 AM, James Plank wrote:
>> Hi all -- no real comments, except as I mentioned to Ric, my tutorial
>> in FAST last February presents Reed-Solomon coding with Cauchy
>> matrices, and then makes special note of the common pitfall of
>> assuming that you can append a Vandermonde matrix to an identity
>> matrix.  Please see
>> http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf,
>> slides 48-52.
>>
>> Andrea, does the matrix that you included in an earlier mail (the one
>> that has Linux RAID-6 in the first two rows) have a general form, or
>> did you develop it in an ad hoc manner so that it would include Linux
>> RAID-6 in the first two rows?
> Hello Jim,
>
> It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
> today. ;)
>
> I'm not attempting to marginalize Andrea's work here, but I can't help
> but ponder what the real value of triple parity RAID is, or quad, or
> beyond.  Some time ago parity RAID's primary mission ceased to be
> surviving single drive failure, or a 2nd failure during rebuild, and
> became mitigating UREs during a drive rebuild.  So we're now talking
> about dedicating 3 drives of capacity to avoiding disaster due to
> platter defects and secondary drive failure.  For small arrays this is
> approaching half the array capacity.  So here parity RAID has lost the
> battle with RAID10's capacity disadvantage, yet it still suffers the
> vastly inferior performance in normal read/write IO, not to mention
> rebuild times that are 3-10x longer.
>
> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
> to mirror a drive at full streaming bandwidth, assuming 300MB/s
> average--and that is probably being kind to the drive makers.  With 6 or
> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
> minimum 72 hours or more, probably over 100, and probably more yet for
> 3P.  And with larger drive count arrays the rebuild times approach a
> week.  Whose users can go a week with degraded performance?  This is
> simply unreasonable, at best.  I say it's completely unacceptable.
>
> With these gargantuan drives coming soon, the probability of multiple
> UREs during rebuild are pretty high.

No because if you are correct about the very high CPU overhead during 
rebuild (which I don't see so dramatic as Andrea claims 500MB/sec for 
triple-parity, probably parallelizable on multiple cores), the speed of 
rebuild decreases proportionally and hence the stress and heating on the 
drives proportionally reduces, approximating that of normal operation.
And how often have you seen a drive failure in a week during normal 
operation?

But in reality, consider that a non-naive implementation of 
multiple-parity would probably use just the single parity during 
reconstruction if just one disk fails, using the multiple parities only 
to read the stripes which are unreadable at single parity. So the speed 
and time of reconstruction and performance penalty would be that of 
raid5 except in exceptional situations of multiple failures.

> ...
> What I envision is an array type, something similar to RAID 51, i.e.
> striped parity over mirror pairs. ....

I don't like your approach of raid 51: it has the write overhead of 
raid5, with the waste of space of raid1.
So it cannot be used as neither a performance array nor a capacity array.
In the scope of this discussion (we are talking about very large 
arrays), the waste of space of your solution, higher than 50%, will make 
your solution costing double the price.

A competitor for the multiple-parity scheme might be raid65 or 66, but 
this is a so much dirtier approach than multiple parity if you think at 
the kind of rmw and overhead that will occur during normal operation.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 18:09             ` John Williams
  2013-11-20 18:44               ` Andrea Mazzoleni
@ 2013-11-21  8:32               ` David Brown
  1 sibling, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-21  8:32 UTC (permalink / raw)
  To: John Williams
  Cc: Andrea Mazzoleni, H. Peter Anvin, Linux RAID Mailing List,
	Btrfs BTRFS, David Smith

On 20/11/13 19:09, John Williams wrote:
> On Wed, Nov 20, 2013 at 2:31 AM, David Brown <david.brown@hesbynett.no> wrote:
>> That's certainly a reasonable way to look at it.  We should not limit
>> the possibilities for high-end systems because of the limitations of
>> low-end systems that are unlikely to use 3+ parity anyway.  I've also
>> looked up a list of the processors that support SSE3 and PSHUFB - a lot
>> of modern "low-end" x86 cpus support it.  And of course it is possible
>> to implement general G(2^8) multiplication without PSHUFB, using a
>> lookup table - it is important that this can all work with any CPU, even
>> if it is slow.
> 
> Unfortunately, it is SSSE3 that is required for PSHUFB. The SSE3 set
> with only two-esses does not suffice. I made that same mistake when I
> first heard about Andrea's 6-parity work. SSSE3 vs. SSE3, confusing
> notation!
> 
> SSSE3 is significantly less widely supported than SSE3. Particularly
> on AMD, only the very latest CPUs seem to support SSSE3. Intel support
> for SSSE3 goes back much further than AMD support.
> 
> Maybe it is not such a big problem, since it may be possible to
> support two "roads". Both roads would include the current md RAID-5
> and RAID-6. But one road, which those lacking CPUs supporting SSSE3
> might choose, would continue on to the non-SSSE3 triple-parity 2^-1
> technique, and then dead-end. The other road would continue with the
> Cauchy matrix technique through 3-parity all the way to 6-parity.
> 
> It might even be feasible to allow someone stuck at the end of the
> non-SSSE3 road to convert to the Cauchy road. You would have to go
> through all the 2^-1 triple-parity and convert it to Cauchy
> triple-parity. But then you would be safely on the Cauchy road.
> 

I would not like to see two alternative triple-parity solutions - I
think that would lead to confusion, and a non-Cauchy triple parity would
not be extendible without a rebuild (I've talked before about the idea
of temporarily adding an extra parity drive with an asymmetric layout.
I really like the idea, so I keep pushing for it!).

I think it is better to accept that 3+ parity will be slow on processors
that don't support PSHUFB.  We should try to find the best alternative
SIMD for other realistic processors (such as on AMD chips without
PSHUFB, ARM's with NEON, PPC with Altivec, etc.) - but a simple table
lookup will always work as a fallback.  Other than that I think it is
fair to say that if you want /fast/ 3+ parity, you need a reasonably
modern non-budget-class cpu.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 18:34             ` Andrea Mazzoleni
  2013-11-20 18:43               ` H. Peter Anvin
@ 2013-11-21  8:36               ` David Brown
  1 sibling, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-21  8:36 UTC (permalink / raw)
  To: Andrea Mazzoleni; +Cc: H. Peter Anvin, linux-raid, linux-btrfs, David Smith

On 20/11/13 19:34, Andrea Mazzoleni wrote:
> Hi David,
> 
>>> The choice of ZFS to use powers of 4 was likely not optimal,
>>> because to multiply by 4, it has to do two multiplications by 2.
>> I can agree with that.  I didn't copy ZFS's choice here
> David, it was not my intention to suggest that you copied from ZFS.
> Sorry to have expressed myself badly. I just mentioned ZFS because it's
> an implementation that I know uses powers of 4 to generate triple
> parity, and I saw in the code that it's implemented with two multiplication
> by 2.
> 

Andrea, I didn't take your comment as an accusation of any kind - there
is no need for any kind of apology!  It was was merely a statement of
fact - I picked powers of 4 as an obvious extension of the powers of 2
in raid6, and found it worked well.

And of course, in the open source world, copying of code and ideas is a
good thing - there is no point in re-inventing the wheel unless we can
invent a better one.  Really, I /should/ have read the ZFS
implementation and copied it!

mvh.,

David

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  1:28             ` Stan Hoeppner
  2013-11-21  2:46               ` John Williams
  2013-11-21  8:08               ` joystick
@ 2013-11-21  9:07               ` David Brown
  2013-11-21  9:54                 ` Adam Goryachev
                                   ` (2 more replies)
  2013-11-21 19:56               ` Piergiorgio Sartor
  3 siblings, 3 replies; 104+ messages in thread
From: David Brown @ 2013-11-21  9:07 UTC (permalink / raw)
  To: stan, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

On 21/11/13 02:28, Stan Hoeppner wrote:
> On 11/20/2013 10:16 AM, James Plank wrote:
>> Hi all -- no real comments, except as I mentioned to Ric, my tutorial
>> in FAST last February presents Reed-Solomon coding with Cauchy
>> matrices, and then makes special note of the common pitfall of
>> assuming that you can append a Vandermonde matrix to an identity
>> matrix.  Please see
>> http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf,
>> slides 48-52.
>>
>> Andrea, does the matrix that you included in an earlier mail (the one
>> that has Linux RAID-6 in the first two rows) have a general form, or
>> did you develop it in an ad hoc manner so that it would include Linux
>> RAID-6 in the first two rows?
> 
> Hello Jim,
> 
> It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
> today. ;)
> 
> I'm not attempting to marginalize Andrea's work here, but I can't help
> but ponder what the real value of triple parity RAID is, or quad, or
> beyond.  Some time ago parity RAID's primary mission ceased to be
> surviving single drive failure, or a 2nd failure during rebuild, and
> became mitigating UREs during a drive rebuild.  So we're now talking
> about dedicating 3 drives of capacity to avoiding disaster due to
> platter defects and secondary drive failure.  For small arrays this is
> approaching half the array capacity.  So here parity RAID has lost the
> battle with RAID10's capacity disadvantage, yet it still suffers the
> vastly inferior performance in normal read/write IO, not to mention
> rebuild times that are 3-10x longer.
> 
> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
> to mirror a drive at full streaming bandwidth, assuming 300MB/s
> average--and that is probably being kind to the drive makers.  With 6 or
> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
> minimum 72 hours or more, probably over 100, and probably more yet for
> 3P.  And with larger drive count arrays the rebuild times approach a
> week.  Whose users can go a week with degraded performance?  This is
> simply unreasonable, at best.  I say it's completely unacceptable.
> 
> With these gargantuan drives coming soon, the probability of multiple
> UREs during rebuild are pretty high.  Continuing to use ever more
> complex parity RAID schemes simply increases rebuild time further.  The
> longer the rebuild, the more likely a subsequent drive failure due to
> heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
> one failure mode we're increasing the probability of another.  TANSTAFL.
>  Worse yet, RAID10 isn't going to survive because UREs on a single drive
> are increasingly likely with these larger drives, and one URE during
> rebuild destroys the array.
> 

I don't think the chances of hitting an URE during rebuild is dependent
on the rebuild time - merely on the amount of data read during rebuild.
 URE rates are "per byte read" rather than "per unit time", are they not?

I think you are overestimating the rebuild times a bit, but there is no
arguing that rebuild on parity raids is a lot more work (for the cpu,
the IO system, and the disks) than for mirror raids.

> I think people are going to have to come to grips with using more and
> more drives simply to brace the legs holding up their arrays; comes to
> grips with these insane rebuild times; or bite the bullet they so
> steadfastly avoided with RAID10.  Lots more spindles solves problems,
> but at a greater cost--again, no free lunch.
> 
> What I envision is an array type, something similar to RAID 51, i.e.
> striped parity over mirror pairs.  In the case of Linux, this would need
> to be a new distinct md/RAID level, as both the RAID5 and RAID1 code
> would need enhancement before being meshed together into this new level[1].

Shouldn't we be talking about RAID 15 here, rather than RAID 51 ?  I
interpret "RAID 15" to be like "RAID 10" - a raid5 set of raid1 mirrors,
while "RAID 51" would be a raid1 mirror of raid5 sets.  I am certain
that you mean a raid5 set of raid1 pairs - I just think you've got the
name wrong.

> 
> Potential Advantages:
> 
> 1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count

+2 disks (the raid5 parity "disk" is a raid1 pair)

> 2.  Rebuild time is the same as RAID 10, unless a mirror pair is lost
> 3.  Parity is only used during rebuild if/when a URE occurs, unless ^
> 4.  Single drive failure doesn't degrade the parity array, multiple
>     failures in different mirrors doesn't degrade the parity array
> 5.  Can sustain a minimum of 3 simultaneous drive failures--both drives
>     in one mirror and one drive in another mirror
> 6.  Can lose a maximum of 1/2 of the drives plus 1 drive--one more than
>     RAID 10.  Can lose half the drives and still not degrade parity,
>     if no two comprise one mirror
> 7.  Similar or possibly better read throughput vs triple parity RAID
> 8.  Superior write performance with drives down
> 9.  Vastly superior rebuild performance, as rebuilds will rarely, if
>     ever, involve parity
> 
> Potential Disadvantages:
> 
> 1.  +1 disk overhead vs RAID 10, many more than 2/3P w/large arrays
> 2.  Read-modify-write penalty vs RAID 10
> 3.  Slower write throughput vs triple parity RAID due to spindle deficit
> 4.  Development effort
> 5.  ??
> 
> 
> [1]  The RAID1/5 code would need to be patched to properly handle a URE
> encountered by the RAID1 code during rebuild.  There are surely other
> modifications and/or optimizations that would be needed.  For large
> sequential reads, more deterministic read interleaving between mirror
> pairs would be a good candidate I think.  IIUC the RAID1 driver does
> read interleaving on a per thread basis or some such, which I don't
> believe is going to work for this "RAID 51" scenario, at least not for
> single streaming reads.  If this can be done well, we double the read
> performance of RAID5, and thus we don't completely "waste" all the extra
> disks vs big_parity schemes.
> 
> This proposed "RAID level 51" should have drastically lower rebuild
> times vs traditional striped parity, should not suffer read/write
> performance degradation with most disk failure scenarios, and with a
> read interleaving optimization may have significantly greater streaming
> read throughput as well.
> 
> This is far from a perfect solution and I am certainly not promoting it
> as such.  But I think it does have some serious advantages over
> traditional striped parity schemes, and at minimum is worth discussion
> as a counterpoint of sorts.
> 

I don't see that there needs to be any changes to the existing md code
to make raid15 work - it is merely a raid 5 made from a set of raid1
pairs.  I can see that improved threading and interleaving could be a
benefit here - but that's the case in general for md raid, and it is
something that the developers are already working on (I haven't followed
the details, but the topic comes up regularly on the list here).

So as far as I can see, you've got raid15 support already - if that's
what suits your needs, use it.  Future improvements to the md code are
only needed to make it faster.

Of course, there is scope for making specific raid15 support in md along
the lines of the raid10 code - raid15,f2 would have the same speed
advantages over "normal" raid1+5 as raid10,f2 has over raid1+0.  Whether
it is worth the effort implementing it is a different matter.


I can see plenty of reasons why raid15 might be a good idea, and even
raid16 for 5 disk redundancy, compared to multi-parity sets.  However,
it costs a lot in disk space.  For example, with 20 disks at 1 TB each,
you can have:

raid5 = 19TB, 1 disk redundancy
raid6 = 18TB, 2 disk redundancy
raid6.3 = 17TB, 3 disk redundancy
raid6.4 = 16TB, 4 disk redundancy
raid6.5 = 15TB, 5 disk redundancy

raid10 = 10TB, 1 disk redundancy
raid15 = 8TB, 3 disk redundancy
raid16 = 6TB, 5 disk redundancy


That's a very significant difference.

Implementing 3+ parity does not stop people using raid15, or similar
schemes - it just adds more choice to let people optimise according to
their needs.





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  9:07               ` David Brown
@ 2013-11-21  9:54                 ` Adam Goryachev
  2013-11-21 10:32                   ` David Brown
  2013-11-22  8:12                   ` Russell Coker
  2013-11-22  8:13                 ` Stan Hoeppner
  2013-11-22  8:38                 ` Stan Hoeppner
  2 siblings, 2 replies; 104+ messages in thread
From: Adam Goryachev @ 2013-11-21  9:54 UTC (permalink / raw)
  To: David Brown, stan, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

On 21/11/13 20:07, David Brown wrote:
> I can see plenty of reasons why raid15 might be a good idea, and even
> raid16 for 5 disk redundancy, compared to multi-parity sets.  However,
> it costs a lot in disk space.  For example, with 20 disks at 1 TB each,
> you can have:
>
> raid5 = 19TB, 1 disk redundancy
> raid6 = 18TB, 2 disk redundancy
> raid6.3 = 17TB, 3 disk redundancy
> raid6.4 = 16TB, 4 disk redundancy
> raid6.5 = 15TB, 5 disk redundancy
>
> raid10 = 10TB, 1 disk redundancy
> raid15 = 8TB, 3 disk redundancy
> raid16 = 6TB, 5 disk redundancy
>
>
> That's a very significant difference.
>
> Implementing 3+ parity does not stop people using raid15, or similar
> schemes - it just adds more choice to let people optimise according to
> their needs.
BTW, as far as strange RAID type options to try and get around problems
with failed disks, before I learned about timeout mismatches, I was
pretty worried when my 5 disk RAID5 kept falling apart and losing a
random member, then adding the failed disk back would work perfectly. To
help me feel better about this, I used 5 x 500GB drives in RAID5 and
then used the RAID5 + 1 x 2TB drive in RAID1, meaning I could afford to
lose any two disks without losing data. Of course, now I know RAID6
might have been a better choice, or even simply 2 x 2TB drives in RAID1 :)

In any case, I'm not sure I understand the concern with RAID 7.X (as it
is being called, where X > 2). Certainly you will need to make 1
computation for each stripe being written, for each value of X, so RAID
7.5 with 5 disk redundancy means 5 calculations for each stripe being
written. However, given that drives are getting bigger every year, did
we forget that we are also getting faster CPU and also more cores in a
single "CPU package"?

On a pure storage server, the CPU would normally have nothing to do,
except a little interrupt handling, it is just shuffling bytes around.
Of course, if you need RAID7.5 then you probably have a dedicated
storage server, so I don't see the problem with using the CPU to do all
the calculations.

Of course, if you are asking about carbon emissions, and cooling costs
in the data center, this could (on a global scale) have a significant
impact, so maybe it is a bad idea after all :)

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 21:59     ` Piergiorgio Sartor
@ 2013-11-21 10:13       ` David Brown
  2013-11-21 17:37         ` Goffredo Baroncelli
  2013-11-21 20:05         ` Piergiorgio Sartor
  0 siblings, 2 replies; 104+ messages in thread
From: David Brown @ 2013-11-21 10:13 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Andrea Mazzoleni, linux-raid, linux-btrfs, hpa, creamyfish

On 20/11/13 22:59, Piergiorgio Sartor wrote:
> On Wed, Nov 20, 2013 at 11:44:39AM +0100, David Brown wrote:
> [...]
>>> In RAID-6 (as per raid6check) there is an easy way
>>> to verify where an HDD has incorrect data.
>>>
>>
>> I think the way to do that is just to generate the parity blocks from
>> the data blocks, and compare them to the existing parity blocks.
> 
> Uhm, the generic RS decoder should try all
> the possible combination of erasure and so
> detect the error.
> This is unfeasible already with 3 parities,
> so there are faster algorithms, I believe:
> 
> Peterson–Gorenstein–Zierler algorithm
> Berlekamp–Massey algorithm
> 
> Nevertheless, I do not know too much about
> those, so I cannot state if they apply to
> the Cauchy matrix as explained here.
> 
> bye,
> 

Ah, you are trying to find which disk has incorrect data so that you can
change just that one disk?  There are dangers with that...

<http://neil.brown.name/blog/20100211050355>

If you disagree with this blog post (and I urge you to read it in full
first), then this is how I would do a "smart" stripe recovery:

First calculate the parities from the data blocks, and compare these
with the existing parity blocks.

If they all match, the stripe is consistent.

Normal (detectable) disk errors and unrecoverable read errors get
flagged by the disk and the IO system, and you /know/ there is a problem
with that block.  Whether it is a data block or a parity block, you
re-generate the correct data and store it - that's what your raid is for.

If you have no detected read errors, and there is one parity
inconsistency, then /probably/ that block has had an undetected read
error, or it simply has not been written completely before a crash.
Either way, just re-write the correct parity.

If there are two or more parity inconsistencies, but not all parities
are in error, then you either have multiple disk or block failures, or
you have a partly-written stripe.  Any attempts at "smart" correction
will almost certainly be worse than just re-writing the new parities and
hoping that the filesystem's journal works.

If all the parities are inconsistent, then the "smart" thing is to look
for a single incorrect disk block.  Just step through the blocks one by
one - assume that block is wrong and replace it (in temporary memory,
not on disk!) with a recovered version from the other data blocks and
the parities (only the first parity is needed).  Re-calculate the other
parities and compare.  If the other parities now match, then you have
found a single inconsistent data block.  It /may/ be a good idea to
re-write this - or maybe not (see the blog post linked above).

If you don't find any single data blocks that can be "corrected" in this
way, then re-writing the parity blocks to match the disk data is
probably the least harmful fix.

Remember, this is not a general error detection and correction scheme -
it is a system targeted for a particular type of use, with particular
patterns of failure and failure causes, and particular mechanisms on top
(journalled file systems) to consider.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  9:54                 ` Adam Goryachev
@ 2013-11-21 10:32                   ` David Brown
  2013-11-22  8:12                   ` Russell Coker
  1 sibling, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-21 10:32 UTC (permalink / raw)
  To: Adam Goryachev, stan, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

On 21/11/13 10:54, Adam Goryachev wrote:
> On 21/11/13 20:07, David Brown wrote:
>> I can see plenty of reasons why raid15 might be a good idea, and even
>> raid16 for 5 disk redundancy, compared to multi-parity sets.  However,
>> it costs a lot in disk space.  For example, with 20 disks at 1 TB each,
>> you can have:
>>
>> raid5 = 19TB, 1 disk redundancy
>> raid6 = 18TB, 2 disk redundancy
>> raid6.3 = 17TB, 3 disk redundancy
>> raid6.4 = 16TB, 4 disk redundancy
>> raid6.5 = 15TB, 5 disk redundancy
>>
>> raid10 = 10TB, 1 disk redundancy
>> raid15 = 8TB, 3 disk redundancy
>> raid16 = 6TB, 5 disk redundancy
>>
>>
>> That's a very significant difference.
>>
>> Implementing 3+ parity does not stop people using raid15, or similar
>> schemes - it just adds more choice to let people optimise according to
>> their needs.
> BTW, as far as strange RAID type options to try and get around problems
> with failed disks, before I learned about timeout mismatches, I was
> pretty worried when my 5 disk RAID5 kept falling apart and losing a
> random member, then adding the failed disk back would work perfectly. To
> help me feel better about this, I used 5 x 500GB drives in RAID5 and
> then used the RAID5 + 1 x 2TB drive in RAID1, meaning I could afford to
> lose any two disks without losing data. Of course, now I know RAID6
> might have been a better choice, or even simply 2 x 2TB drives in RAID1 :)
> 
> In any case, I'm not sure I understand the concern with RAID 7.X (as it
> is being called, where X > 2). Certainly you will need to make 1
> computation for each stripe being written, for each value of X, so RAID
> 7.5 with 5 disk redundancy means 5 calculations for each stripe being
> written. However, given that drives are getting bigger every year, did
> we forget that we are also getting faster CPU and also more cores in a
> single "CPU package"?
> 

This is all true.  And md code is getting better at using more cores
under more circumstances, making the parity calculations more efficient.

The speed concern (which was Stan's, rather than mine) is more about
recovery and rebuild.  If you have a layered raid with raid1 pairs at
the bottom level, then recovery and rebuild (from a single failure) is
just a straight copy from one disk to another - you don't get faster
than that.  If you have a 20 + 3 parity raid, then rebuilding requires
reading a stripe from 20 disks and writing to 1 disk - that's far more
effort and is likely to take more time unless your IO system can handle
full bandwidth of all the disks simultaneously.

Similarly, performance of the array while rebuilding or degraded is much
worse for parity raids than for raids on top of raid1 pairs.

How that matters to you, and how it balances with the space costs, is up
to you and your application.

> On a pure storage server, the CPU would normally have nothing to do,
> except a little interrupt handling, it is just shuffling bytes around.
> Of course, if you need RAID7.5 then you probably have a dedicated
> storage server, so I don't see the problem with using the CPU to do all
> the calculations.
> 
> Of course, if you are asking about carbon emissions, and cooling costs
> in the data center, this could (on a global scale) have a significant
> impact, so maybe it is a bad idea after all :)
> 
> Regards,
> Adam
> 


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21 10:13       ` David Brown
@ 2013-11-21 17:37         ` Goffredo Baroncelli
  2013-11-21 20:05         ` Piergiorgio Sartor
  1 sibling, 0 replies; 104+ messages in thread
From: Goffredo Baroncelli @ 2013-11-21 17:37 UTC (permalink / raw)
  To: David Brown
  Cc: Piergiorgio Sartor, Andrea Mazzoleni, linux-raid, linux-btrfs,
	hpa, creamyfish

On 2013-11-21 11:13, David Brown wrote:
> On 20/11/13 22:59, Piergiorgio Sartor wrote:
>> On Wed, Nov 20, 2013 at 11:44:39AM +0100, David Brown wrote:
>> [...]
>>>> In RAID-6 (as per raid6check) there is an easy way
>>>> to verify where an HDD has incorrect data.
>>>>
>>>
>>> I think the way to do that is just to generate the parity blocks from
>>> the data blocks, and compare them to the existing parity blocks.
>>
>> Uhm, the generic RS decoder should try all
>> the possible combination of erasure and so
>> detect the error.
>> This is unfeasible already with 3 parities,
>> so there are faster algorithms, I believe:
>>
>> Peterson–Gorenstein–Zierler algorithm
>> Berlekamp–Massey algorithm
>>
>> Nevertheless, I do not know too much about
>> those, so I cannot state if they apply to
>> the Cauchy matrix as explained here.
>>
>> bye,
>>
> 
> Ah, you are trying to find which disk has incorrect data so that you can
> change just that one disk?  There are dangers with that...
> 
> <http://neil.brown.name/blog/20100211050355>
> 
> If you disagree with this blog post (and I urge you to read it in full
> first), then this is how I would do a "smart" stripe recovery:
> 
> 
> First calculate the parities from the data blocks, and compare these
> with the existing parity blocks.
> 
> If they all match, the stripe is consistent.
> 
> Normal (detectable) disk errors and unrecoverable read errors get
> flagged by the disk and the IO system, and you /know/ there is a problem
> with that block.  Whether it is a data block or a parity block, you
> re-generate the correct data and store it - that's what your raid is for.
> 
> If you have no detected read errors, and there is one parity
> inconsistency, then /probably/ that block has had an undetected read
> error, or it simply has not been written completely before a crash.
> Either way, just re-write the correct parity.
> 
> If there are two or more parity inconsistencies, but not all parities
> are in error, then you either have multiple disk or block failures, or
> you have a partly-written stripe.  Any attempts at "smart" correction
> will almost certainly be worse than just re-writing the new parities and
> hoping that the filesystem's journal works.
> 
> If all the parities are inconsistent, then the "smart" thing is to look
> for a single incorrect disk block.  Just step through the blocks one by
> one - assume that block is wrong and replace it (in temporary memory,
> not on disk!) with a recovered version from the other data blocks and
> the parities (only the first parity is needed).  Re-calculate the other
> parities and compare.  If the other parities now match, then you have
> found a single inconsistent data block.  It /may/ be a good idea to
> re-write this - or maybe not (see the blog post linked above).
> 
> If you don't find any single data blocks that can be "corrected" in this
> way, then re-writing the parity blocks to match the disk data is
> probably the least harmful fix.

It has to be pointed out that all filesystems or are trying to integrate
or have integrated some sort of checksumming to avoid guessing which
between the data and/or the parity is wrong:
- btrfs is fully checksummed
- zfs is fully checksummed
- ext4 and xfs are trying to checksumming the metadata [1][2]
- refs (Windows) protects with checksum the metadata and (optionally)
the data [3]

We are talking about


[1] https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
[2] http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf
[3] http://en.wikipedia.org/wiki/ReFS

> 
> 
> Remember, this is not a general error detection and correction scheme -
> it is a system targeted for a particular type of use, with particular
> patterns of failure and failure causes, and particular mechanisms on top
> (journalled file systems) to consider.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  1:28             ` Stan Hoeppner
                                 ` (2 preceding siblings ...)
  2013-11-21  9:07               ` David Brown
@ 2013-11-21 19:56               ` Piergiorgio Sartor
  3 siblings, 0 replies; 104+ messages in thread
From: Piergiorgio Sartor @ 2013-11-21 19:56 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	linux-raid, linux-btrfs, David Brown, David Smith

On Wed, Nov 20, 2013 at 07:28:37PM -0600, Stan Hoeppner wrote:
[...]
> It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
> today. ;)
> 
> I'm not attempting to marginalize Andrea's work here, but I can't help
> but ponder what the real value of triple parity RAID is, or quad, or
> beyond.  Some time ago parity RAID's primary mission ceased to be

Hi Stan,

my opinio is that you have to think
in terms of storage devices which are
not always available.
Those are not simply directly connected
HDDs, it could be more exotic.
The example I consider is a p2p network
storage, where the nodes are very little
reliable.
I guess that could be more.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21 10:13       ` David Brown
  2013-11-21 17:37         ` Goffredo Baroncelli
@ 2013-11-21 20:05         ` Piergiorgio Sartor
  2013-11-21 20:31           ` David Brown
  1 sibling, 1 reply; 104+ messages in thread
From: Piergiorgio Sartor @ 2013-11-21 20:05 UTC (permalink / raw)
  To: David Brown
  Cc: Piergiorgio Sartor, Andrea Mazzoleni, linux-raid, linux-btrfs,
	hpa, creamyfish

On Thu, Nov 21, 2013 at 11:13:29AM +0100, David Brown wrote:
[...]
> Ah, you are trying to find which disk has incorrect data so that you can
> change just that one disk?  There are dangers with that...

Hi David,

> <http://neil.brown.name/blog/20100211050355>

I think we already did the exercise, here :-)

> If you disagree with this blog post (and I urge you to read it in full

We discussed the topic (with Neil) and, if I
recall correctly, he is agaist having an
_automatic_ error detectio and correction _in_
kernel.
I fully agree with that: user space is better
and it should not be automatic, but it should
do things under user control.

The current "check" operetion is pretty poor.
It just reports how many mismatches, it does
not even report where in the array.
The first step, independent from how many
parities one has, would be to tell the user
where the mismatches occurred, so it would
be possible to check the FS at that position.
Having a multi parity RAID allows to check
even which disk.
This would provide the user with a more
comprehensive (I forgot the spelling)
information.

Of course, since we are there, we can
also give the option to fix it.
This would be much likely a "fsck".

> first), then this is how I would do a "smart" stripe recovery:
> 
> First calculate the parities from the data blocks, and compare these
> with the existing parity blocks.
> 
> If they all match, the stripe is consistent.
> 
> Normal (detectable) disk errors and unrecoverable read errors get
> flagged by the disk and the IO system, and you /know/ there is a problem
> with that block.  Whether it is a data block or a parity block, you
> re-generate the correct data and store it - that's what your raid is for.

That's not always the case, otherwise
having the mismatch count would be useless.
The issue is that errors appear, whatever
the reason, without being reported by the
underlying hardware.

> If you have no detected read errors, and there is one parity
> inconsistency, then /probably/ that block has had an undetected read
> error, or it simply has not been written completely before a crash.
> Either way, just re-write the correct parity.

Why re-write the parity if I can get
the correct data there?
If can be sure that one data block is
incorrect and I can re-create properly,
that's the thing to do.

> Remember, this is not a general error detection and correction scheme -

It is not, but it could be. For free.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21 20:05         ` Piergiorgio Sartor
@ 2013-11-21 20:31           ` David Brown
  2013-11-21 20:52             ` Piergiorgio Sartor
  2013-11-26 18:10             ` joystick
  0 siblings, 2 replies; 104+ messages in thread
From: David Brown @ 2013-11-21 20:31 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Andrea Mazzoleni, linux-raid, linux-btrfs, hpa, creamyfish

On 21/11/13 21:05, Piergiorgio Sartor wrote:
> On Thu, Nov 21, 2013 at 11:13:29AM +0100, David Brown wrote:
> [...]
>> Ah, you are trying to find which disk has incorrect data so that you can
>> change just that one disk?  There are dangers with that...
> 
> Hi David,
> 
>> <http://neil.brown.name/blog/20100211050355>
> 
> I think we already did the exercise, here :-)
> 
>> If you disagree with this blog post (and I urge you to read it in full
> 
> We discussed the topic (with Neil) and, if I
> recall correctly, he is agaist having an
> _automatic_ error detectio and correction _in_
> kernel.
> I fully agree with that: user space is better
> and it should not be automatic, but it should
> do things under user control.
> 

OK.

> The current "check" operetion is pretty poor.
> It just reports how many mismatches, it does
> not even report where in the array.
> The first step, independent from how many
> parities one has, would be to tell the user
> where the mismatches occurred, so it would
> be possible to check the FS at that position.

Certainly it would be good to give the user more information.  If you
can tell the user where the errors are, and what the likely failed block
is, then that would be very useful.  If you can tell where it is in the
filesystem (such as which file, if any, owns the blocks in question)
then that would be even better.

> Having a multi parity RAID allows to check
> even which disk.
> This would provide the user with a more
> comprehensive (I forgot the spelling)
> information.
> 
> Of course, since we are there, we can
> also give the option to fix it.
> This would be much likely a "fsck".

If this can all be done to give the user an informed choice, then it
sounds good.

One issue here is whether the check should be done with the filesystem
mounted and in use, or only off-line.  If it is off-line then it will
mean a long down-time while the array is checked - but if it is online,
then there is the risk of confusing the filesystem and caches by
changing the data.

> 
>> first), then this is how I would do a "smart" stripe recovery:
>>
>> First calculate the parities from the data blocks, and compare these
>> with the existing parity blocks.
>>
>> If they all match, the stripe is consistent.
>>
>> Normal (detectable) disk errors and unrecoverable read errors get
>> flagged by the disk and the IO system, and you /know/ there is a problem
>> with that block.  Whether it is a data block or a parity block, you
>> re-generate the correct data and store it - that's what your raid is for.
> 
> That's not always the case, otherwise
> having the mismatch count would be useless.
> The issue is that errors appear, whatever
> the reason, without being reported by the
> underlying hardware.
>  

(I know you know how this works, so I am not trying to be patronising
with this explanation - I just think we have slightly misunderstood what
the other is saying, so spelling it out will hopefully make it clearer.)

Most disk errors /are/ detectable, and are reported by the underlying
hardware - small surface errors are corrected by the disk's own error
checking and correcting mechanisms, and larger errors are usually
detected.  It is (or should be!) very rare that a read error goes
undetected without there being a major problem with the disk controller.
 And if the error is detected, then the normal raid processing kicks in
as there is no doubt about which block has problems.

>> If you have no detected read errors, and there is one parity
>> inconsistency, then /probably/ that block has had an undetected read
>> error, or it simply has not been written completely before a crash.
>> Either way, just re-write the correct parity.
> 
> Why re-write the parity if I can get
> the correct data there?
> If can be sure that one data block is
> incorrect and I can re-create properly,
> that's the thing to do.

If you can be /sure/ about which data block is incorrect, then I agree -
but you can't be /entirely/ sure.  But I agree that you can make a good
enough guess to recommend a fix to the user - as long as it is not
automatic.

>  
>> Remember, this is not a general error detection and correction scheme -
> 
> It is not, but it could be. For free.
> 

For most ECC schemes, you know that all your blocks are set
synchronously - so any block that does not fit in, is an error.  With
raid, it could also be that a stripe is only partly written - you can
have two different valid sets of data mixed to give an inconsistent
stripe, without any good way of telling what consistent data is the best
choice.

Perhaps a checking tool can take advantage of a write-intent bitmap (if
there is one) so that it knows if an inconsistent stripe is partly
updated or the result of a disk error.

mvh.,

David

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21 20:31           ` David Brown
@ 2013-11-21 20:52             ` Piergiorgio Sartor
  2013-11-22  0:32               ` David Brown
  2013-11-26 18:10             ` joystick
  1 sibling, 1 reply; 104+ messages in thread
From: Piergiorgio Sartor @ 2013-11-21 20:52 UTC (permalink / raw)
  To: David Brown
  Cc: Piergiorgio Sartor, Andrea Mazzoleni, linux-raid, linux-btrfs,
	hpa, creamyfish

Hi David,

On Thu, Nov 21, 2013 at 09:31:46PM +0100, David Brown wrote:
[...]
> If this can all be done to give the user an informed choice, then it
> sounds good.

that would be my target.
To _offer_ more options to the (advanced) user.
It _must_ always be under user control.

> One issue here is whether the check should be done with the filesystem
> mounted and in use, or only off-line.  If it is off-line then it will
> mean a long down-time while the array is checked - but if it is online,
> then there is the risk of confusing the filesystem and caches by
> changing the data.

Currently, "raid6check" can work with FS mounted.
I got the suggestion from Neil (of course).
It is possible to lock one stripe and check it.
This should be, at any given time, consistent
(that is, the parity should always match the data).
If an error is found, it is reported.
Again, the user can decide to fix it or not,
considering all the FS consequences and so on.

> Most disk errors /are/ detectable, and are reported by the underlying
> hardware - small surface errors are corrected by the disk's own error
> checking and correcting mechanisms, and larger errors are usually
> detected.  It is (or should be!) very rare that a read error goes
> undetected without there being a major problem with the disk controller.
>  And if the error is detected, then the normal raid processing kicks in
> as there is no doubt about which block has problems.

That's clear. That case is an "erasure" (I think)
and it is perfectly in line with the usual operation.
I'm not trying to replace this mechanism.

> If you can be /sure/ about which data block is incorrect, then I agree -
> but you can't be /entirely/ sure.  But I agree that you can make a good
> enough guess to recommend a fix to the user - as long as it is not
> automatic.

One typical case is when many errors are
found, belonging to the same disk.
This case clearly shows the disk is to be
replaced or the interface checked...
But, again, the user is the master, not the
machine... :-)

> For most ECC schemes, you know that all your blocks are set
> synchronously - so any block that does not fit in, is an error.  With
> raid, it could also be that a stripe is only partly written - you can

Could it be?
I would consider this an error.
The stripe must always be consistent, there
should be a transactional mechanism to make
sure that, if read back, the data is always
matching the parity.
When I write "read back" I mean from whatever
the data is: physical disk or cache.
Otherwise, the check must run exclusively on
the array (no mounted FS, no other things
running on it).

> have two different valid sets of data mixed to give an inconsistent
> stripe, without any good way of telling what consistent data is the best
> choice.
>  
> Perhaps a checking tool can take advantage of a write-intent bitmap (if
> there is one) so that it knows if an inconsistent stripe is partly
> updated or the result of a disk error.

Of course, this is an option, which should be
taken into consideration.

Any improvement idea is welcome!!!

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  7:05                   ` John Williams
@ 2013-11-21 22:57                     ` Stan Hoeppner
  2013-11-21 23:38                       ` John Williams
  2013-11-22 23:07                       ` NeilBrown
  0 siblings, 2 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-21 22:57 UTC (permalink / raw)
  To: John Williams
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

On 11/21/2013 1:05 AM, John Williams wrote:
> On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 11/20/2013 8:46 PM, John Williams wrote:
>>> For myself or any machines I managed for work that do not need high
>>> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
>>> similar schemes with arrays of 16 - 32 drives.
>>
>> You must see a week long rebuild as acceptable...
> 
> It would not be a problem if it did take that long, since I would have
> extra parity units as backup in case of a failure during a rebuild.
> 
> But of course it would not take that long. Take, for example, a 24 x
> 3TB triple-parity array (21+3) that has had two drive failures
> (perhaps the rebuild started with one failure, but there was soon
> another failure). I would expect the rebuild to take about a day.

You're looking at today.  We're discussing tomorrow's needs.  Today's
6TB 3.5" drives have sustained average throughput of ~175MB/s.
Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
previously, at that rate a straight disk-disk copy of a 20TB drive takes
18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
rebuilding a failed drive in a 3P array of say 8 of these disks will
likely take at least 3 times as long, 2 days 6 hours minimum, probably
more.  This may be perfectly acceptable to some, but probably not to all.

>>> on a subject Adam Leventhal has already
>>> covered in detail in an article "Triple-Parity RAID and Beyond" which
>>> seems to match the subject of this thread quite nicely:
>>>
>>> http://queue.acm.org/detail.cfm?id=1670144
>>
>> Mr. Leventhal did not address the overwhelming problem we face, which is
>> (multiple) parity array reconstruction time.  He assumes the time to
>> simply 'populate' one drive at its max throughput is the total
>> reconstruction time for the array.
> 
> Since Adam wrote the code for RAID-Z3 for ZFS, I'm sure he is aware of
> the time to restore data to failed drives. I do not see any flaw in
> his analysis related to the time needed to restore data to failed
> drives.

He wrote that article in late 2009.  It seems pretty clear he wasn't
looking 10 years forward to 20TB drives, where the minimum mirror
rebuild time will be ~18 hours, and parity rebuild will be much greater.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21 22:57                     ` Stan Hoeppner
@ 2013-11-21 23:38                       ` John Williams
  2013-11-22  9:35                         ` Stan Hoeppner
  2013-11-22 23:07                       ` NeilBrown
  1 sibling, 1 reply; 104+ messages in thread
From: John Williams @ 2013-11-21 23:38 UTC (permalink / raw)
  To: stan
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

On Thu, Nov 21, 2013 at 2:57 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> He wrote that article in late 2009.  It seems pretty clear he wasn't
> looking 10 years forward to 20TB drives, where the minimum mirror
> rebuild time will be ~18 hours, and parity rebuild will be much greater.

Actually, it is completely obvious that he WAS looking ten years
ahead, seeing as several of his graphs have time scales going to
2009+10 = 2019.

And he specifically mentions longer rebuild times as one of the
reasons why higher parity RAIDs are needed.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  8:08               ` joystick
@ 2013-11-22  0:30                 ` Stan Hoeppner
  2013-11-22  0:33                   ` H. Peter Anvin
  2013-11-22  0:45                   ` David Brown
  0 siblings, 2 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-22  0:30 UTC (permalink / raw)
  To: joystick
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	linux-raid, linux-btrfs, David Brown, David Smith

On 11/21/2013 2:08 AM, joystick wrote:
> On 21/11/2013 02:28, Stan Hoeppner wrote:
...
>> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
>> to mirror a drive at full streaming bandwidth, assuming 300MB/s
>> average--and that is probably being kind to the drive makers.  With 6 or
>> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
>> minimum 72 hours or more, probably over 100, and probably more yet for
>> 3P.  And with larger drive count arrays the rebuild times approach a
>> week.  Whose users can go a week with degraded performance?  This is
>> simply unreasonable, at best.  I say it's completely unacceptable.
>>
>> With these gargantuan drives coming soon, the probability of multiple
>> UREs during rebuild are pretty high.
> 
> No because if you are correct about the very high CPU overhead during

I made no such claim.

> rebuild (which I don't see so dramatic as Andrea claims 500MB/sec for
> triple-parity, probably parallelizable on multiple cores), the speed of
> rebuild decreases proportionally 

The rebuild time of a parity array normally has little to do with CPU
overhead.  The bulk of the elapsed time is due to:

1.  The serial nature of the rebuild algorithm
2.  The random IO pattern of the reads
3.  The rotational latency of the drives

#3 is typically the largest portion of the elapsed time.

> and hence the stress and heating on the
> drives proportionally reduces, approximating that of normal operation.
> And how often have you seen a drive failure in a week during normal
> operation?

This depends greatly on one's normal operation.  In general, for most
users of parity arrays, any full array operation such as a rebuild or
reshape is far more taxing on the drives, in both power draw and heat
dissipation, than 'normal' operation.

> But in reality, consider that a non-naive implementation of
> multiple-parity would probably use just the single parity during
> reconstruction if just one disk fails, using the multiple parities only
> to read the stripes which are unreadable at single parity. So the speed
> and time of reconstruction and performance penalty would be that of
> raid5 except in exceptional situations of multiple failures.

That may very well be, but it doesn't change #2,3 above.

>> What I envision is an array type, something similar to RAID 51, i.e.
>> striped parity over mirror pairs. ....
> 
> I don't like your approach of raid 51: it has the write overhead of
> raid5, with the waste of space of raid1.
> So it cannot be used as neither a performance array nor a capacity array.

I don't like it either.  It's a compromise.  But as RAID1/10 will soon
be unusable due to URE probability during rebuild, I think it's a
relatively good compromise for some users, some workloads.

> In the scope of this discussion (we are talking about very large
> arrays), 

Capacity yes, drive count, no.  Drive capacities are increasing at a
much faster rate than our need for storage space.  As we move forward
the trend will be building larger capacity arrays with fewer disks.

> the waste of space of your solution, higher than 50%, will make
> your solution costing double the price.

This is the classic mirror vs parity argument.  Using 1 more disk to add
parity to striped mirrors doesn't change it.  "Waste" is in the eye of
the beholder.  Anyone currently using RAID10 will have no problem
dedicating one more disk for uptime, protection.

> A competitor for the multiple-parity scheme might be raid65 or 66, but
> this is a so much dirtier approach than multiple parity if you think at
> the kind of rmw and overhead that will occur during normal operation.

Neither of those has any advantage over multi-parity.  I suggested this
approach because it retains all of the advantages of RAID10 but one.  We
sacrifice fast random write performance for protection against UREs, the
same reason behind 3P.  That's what the single parity is for, and that
alone.

I suggest that anyone in the future needing fast random write IOPS is
going to move those workloads to SSD, which is steadily increasing in
capacity.  And I suggest anyone building arrays with 10-20TB drives
isn't in need of fast random write IOPS.  Whether this approach is
valuable to anyone depends on whether the remaining attributes of
RAID10, with the added URE protection, are worth the drive count.
Obviously proponents of traditional parity arrays will not think so.
Users of RAID10 may.  Even if md never supports such a scheme, I bet
we'll see something similar to this in enterprise gear, where rebuilds
need to be 'fast' and performance degradation due to a downed drive is
not acceptable.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21 20:52             ` Piergiorgio Sartor
@ 2013-11-22  0:32               ` David Brown
  2013-11-22 20:32                 ` Piergiorgio Sartor
  0 siblings, 1 reply; 104+ messages in thread
From: David Brown @ 2013-11-22  0:32 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Andrea Mazzoleni, linux-raid, linux-btrfs, hpa, creamyfish

On 21/11/13 21:52, Piergiorgio Sartor wrote:
> Hi David,
> 
> On Thu, Nov 21, 2013 at 09:31:46PM +0100, David Brown wrote:
> [...]
>> If this can all be done to give the user an informed choice, then it
>> sounds good.
> 
> that would be my target.
> To _offer_ more options to the (advanced) user.
> It _must_ always be under user control.
> 
>> One issue here is whether the check should be done with the filesystem
>> mounted and in use, or only off-line.  If it is off-line then it will
>> mean a long down-time while the array is checked - but if it is online,
>> then there is the risk of confusing the filesystem and caches by
>> changing the data.
> 
> Currently, "raid6check" can work with FS mounted.
> I got the suggestion from Neil (of course).
> It is possible to lock one stripe and check it.
> This should be, at any given time, consistent
> (that is, the parity should always match the data).
> If an error is found, it is reported.
> Again, the user can decide to fix it or not,
> considering all the FS consequences and so on.
> 

If you can lock stripes, and make sure any old data from that stripe is
flushed from the caches (if you change it while locked), then that
sounds ideal.

>> Most disk errors /are/ detectable, and are reported by the underlying
>> hardware - small surface errors are corrected by the disk's own error
>> checking and correcting mechanisms, and larger errors are usually
>> detected.  It is (or should be!) very rare that a read error goes
>> undetected without there being a major problem with the disk controller.
>>  And if the error is detected, then the normal raid processing kicks in
>> as there is no doubt about which block has problems.
> 
> That's clear. That case is an "erasure" (I think)
> and it is perfectly in line with the usual operation.
> I'm not trying to replace this mechanism.
>  
>> If you can be /sure/ about which data block is incorrect, then I agree -
>> but you can't be /entirely/ sure.  But I agree that you can make a good
>> enough guess to recommend a fix to the user - as long as it is not
>> automatic.
> 
> One typical case is when many errors are
> found, belonging to the same disk.
> This case clearly shows the disk is to be
> replaced or the interface checked...
> But, again, the user is the master, not the
> machine... :-)

I don't know what sort of interface you have for the user, but I guess
that means you'll have to collect a number of failures before showing
them so that the user can see the correlation on disk number.

>  
>> For most ECC schemes, you know that all your blocks are set
>> synchronously - so any block that does not fit in, is an error.  With
>> raid, it could also be that a stripe is only partly written - you can
> 
> Could it be?
> I would consider this an error.

It could occur as the result of a failure of some sort (kernel crash,
power failure, temporary disk problem, etc.).  More generally, md raid
doesn't have to be on local physical disks - maybe one of the "disks" is
an iSCSI drive or something else over a network that could have failures
or delays.  I haven't thought through all cases here - I am just
throwing them out as possibilities that might cause trouble.

> The stripe must always be consistent, there
> should be a transactional mechanism to make
> sure that, if read back, the data is always
> matching the parity.
> When I write "read back" I mean from whatever
> the data is: physical disk or cache.
> Otherwise, the check must run exclusively on
> the array (no mounted FS, no other things
> running on it).
> 
>> have two different valid sets of data mixed to give an inconsistent
>> stripe, without any good way of telling what consistent data is the best
>> choice.
>>  
>> Perhaps a checking tool can take advantage of a write-intent bitmap (if
>> there is one) so that it knows if an inconsistent stripe is partly
>> updated or the result of a disk error.
> 
> Of course, this is an option, which should be
> taken into consideration.
> 
> Any improvement idea is welcome!!!
> 
> bye,
> 


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  0:30                 ` Stan Hoeppner
@ 2013-11-22  0:33                   ` H. Peter Anvin
  2013-11-22  0:45                   ` David Brown
  1 sibling, 0 replies; 104+ messages in thread
From: H. Peter Anvin @ 2013-11-22  0:33 UTC (permalink / raw)
  To: stan, joystick
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, linux-raid,
	linux-btrfs, David Brown, David Smith

On 11/21/2013 04:30 PM, Stan Hoeppner wrote:
> 
> The rebuild time of a parity array normally has little to do with CPU
> overhead.>

Unless you have to fall back to table driven code.

Anyway, this looks like a great concept.  Now we just need to implement
it ;)

	-hpa


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  0:30                 ` Stan Hoeppner
  2013-11-22  0:33                   ` H. Peter Anvin
@ 2013-11-22  0:45                   ` David Brown
  1 sibling, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-22  0:45 UTC (permalink / raw)
  To: stan, joystick
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	linux-raid, linux-btrfs, David Smith

On 22/11/13 01:30, Stan Hoeppner wrote:

> I don't like it either.  It's a compromise.  But as RAID1/10 will soon
> be unusable due to URE probability during rebuild, I think it's a
> relatively good compromise for some users, some workloads.

An alternative is to move to 3-way raid1 mirrors rather than 2-way
mirrors.  Obviously you take another hit in disk space efficiency, but
reads will be faster than you have extra redundancy.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  9:54                 ` Adam Goryachev
  2013-11-21 10:32                   ` David Brown
@ 2013-11-22  8:12                   ` Russell Coker
  2013-11-25 18:23                     ` Pasi Kärkkäinen
  1 sibling, 1 reply; 104+ messages in thread
From: Russell Coker @ 2013-11-22  8:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-raid

On Thu, 21 Nov 2013 20:54:54 Adam Goryachev wrote:
> On a pure storage server, the CPU would normally have nothing to do,
> except a little interrupt handling, it is just shuffling bytes around.
> Of course, if you need RAID7.5 then you probably have a dedicated
> storage server, so I don't see the problem with using the CPU to do all
> the calculations.

How would the pure storage server in question send data out to client systems?  
SMB?  NFS?  IMAP?  All options will take some CPU power.

There's a common myth that people should use hardware RAID to save CPU time, 
that's obviously wrong in the case of RAID-1 (just writing the data twice) and 
RAID-5 (XOR for parity).  But as we get into the more complex forms of parity 
CPU performance could become an issue.

Of course in terms of overall system performance the cost of doing a RAID 
rebuild in terms of disk seeks will probably exceed the CPU use.  As an aside, 
are there plans to limit the RAID rebuild speed for BTRFS?

On Thu, 21 Nov 2013 18:30:49 Stan Hoeppner wrote:
> I suggest that anyone in the future needing fast random write IOPS is
> going to move those workloads to SSD, which is steadily increasing in
> capacity.  And I suggest anyone building arrays with 10-20TB drives
> isn't in need of fast random write IOPS.

Traditionally SCSI/SAS disks have tended to be a lot smaller than IDE/SATA 
disks.  Now Dell has just started offering 2TB SAS disks while the largest SATA 
disks that they sell (In Australia on PowerEdge T110 servers at least) are 
also 2TB.  Presumably RAID recovery time was one factor that made 
manufacturers not bother with making larger SCSI/SAS disks in the past.

When 20TB disks become available a user could choose to just use the first 10TB 
of the disk.  Even in the days when 40G disks were big some people would use 
less than the full capacity of the disk for performance.

Will people really be storing data that is so important that triple parity is 
needed on 20TB disks but be unable to afford enough disks to use only the first 
10TB of each disk?  Is this a use-case that is worth coding for?

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  9:07               ` David Brown
  2013-11-21  9:54                 ` Adam Goryachev
@ 2013-11-22  8:13                 ` Stan Hoeppner
  2013-11-22 13:15                   ` David Brown
                                     ` (2 more replies)
  2013-11-22  8:38                 ` Stan Hoeppner
  2 siblings, 3 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-22  8:13 UTC (permalink / raw)
  To: David Brown, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

Hi David,

On 11/21/2013 3:07 AM, David Brown wrote:
> On 21/11/13 02:28, Stan Hoeppner wrote:
...
>> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
>> to mirror a drive at full streaming bandwidth, assuming 300MB/s
>> average--and that is probably being kind to the drive makers.  With 6 or
>> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
>> minimum 72 hours or more, probably over 100, and probably more yet for
>> 3P.  And with larger drive count arrays the rebuild times approach a
>> week.  Whose users can go a week with degraded performance?  This is
>> simply unreasonable, at best.  I say it's completely unacceptable.
>>
>> With these gargantuan drives coming soon, the probability of multiple
>> UREs during rebuild are pretty high.  Continuing to use ever more
>> complex parity RAID schemes simply increases rebuild time further.  The
>> longer the rebuild, the more likely a subsequent drive failure due to
>> heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
>> one failure mode we're increasing the probability of another.  TANSTAFL.
>>  Worse yet, RAID10 isn't going to survive because UREs on a single drive
>> are increasingly likely with these larger drives, and one URE during
>> rebuild destroys the array.

> I don't think the chances of hitting an URE during rebuild is dependent
> on the rebuild time - merely on the amount of data read during rebuild.

Please read the above paragraph again, as you misread it the first time.

>  URE rates are "per byte read" rather than "per unit time", are they not?

These are specified by the drive manufacturer, and they are per *bits*
read, not "per byte read".  Current consumer drives are typically rated
at 1 URE in 10^14 bits read, enterprise are 1 in 10^15.

> I think you are overestimating the rebuild times a bit, but there is no

Which part?  A 20TB drive mirror taking 18 hours, or parity arrays
taking many times longer than 18 hours?

> arguing that rebuild on parity raids is a lot more work (for the cpu,
> the IO system, and the disks) than for mirror raids.

It's not so much a matter of work or interface bandwidth, but a matter
of serialization and rotational latency.

...
> Shouldn't we be talking about RAID 15 here, rather than RAID 51 ?  I
> interpret "RAID 15" to be like "RAID 10" - a raid5 set of raid1 mirrors,
> while "RAID 51" would be a raid1 mirror of raid5 sets.  I am certain
> that you mean a raid5 set of raid1 pairs - I just think you've got the
> name wrong.

Now that you mention it, yes, RAID 15 would fit much better with
convention.  Not sure why I thought 51.  So it's RAID 15 from here.

>> Potential Advantages:
>>
>> 1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count
> 
> +2 disks (the raid5 parity "disk" is a raid1 pair)

One drive of each mirror is already gone.  Make a RAID 5 of the
remaining disks and you lose 1 disk.  So you lose 1 additional disk vs
RAID 10, not 2.  As I stated previously, for RAID 15 you lose [1/2]+1 of
your disks to redundancy.

...
>> [1]  The RAID1/5 code would need to be patched to properly handle a URE
>> encountered by the RAID1 code during rebuild.  There are surely other
>> modifications and/or optimizations that would be needed.  For large
>> sequential reads, more deterministic read interleaving between mirror
>> pairs would be a good candidate I think.  IIUC the RAID1 driver does
>> read interleaving on a per thread basis or some such, which I don't
>> believe is going to work for this "RAID 51" scenario, at least not for
>> single streaming reads.  If this can be done well, we double the read
>> performance of RAID5, and thus we don't completely "waste" all the extra
>> disks vs big_parity schemes.
>>
>> This proposed "RAID level 51" should have drastically lower rebuild
>> times vs traditional striped parity, should not suffer read/write
>> performance degradation with most disk failure scenarios, and with a
>> read interleaving optimization may have significantly greater streaming
>> read throughput as well.
>>
>> This is far from a perfect solution and I am certainly not promoting it
>> as such.  But I think it does have some serious advantages over
>> traditional striped parity schemes, and at minimum is worth discussion
>> as a counterpoint of sorts.
> 
> I don't see that there needs to be any changes to the existing md code
> to make raid15 work - it is merely a raid 5 made from a set of raid1
> pairs.  

The sole purpose of the parity layer of the proposed RAID 15 is to
replace sectors lost due to UREs during rebuild.  AFAIK the current RAID
5 and RAID 1 drivers have no code to support each other in this manner.

> I can see that improved threading and interleaving could be a
> benefit here - but that's the case in general for md raid, and it is
> something that the developers are already working on (I haven't followed
> the details, but the topic comes up regularly on the list here).

What I'm talking about here is unrelated to the kernel thread starvation
issue, which is write centric, unrelated to reads.

What I'm suggesting is that it might be possible to improve the
concurrency of reads from the mirror disks using some form of static or
adaptive interleaving or similar.  The purpose of this would be strictly
to improve large single streaming read performance.  If this could be
achieved I do not know.

One possibility may be to count consecutive LBA sectors requested by the
filesystem stream and compare that to some value.  For example, say we
have an 18 disk RAID 15 array which gives us 8 spindles.  With a default
chunk of 512KB this gives us a stripe width of 4MB.  So lets say we
arbitrarily consider any single stream read larger than 4 stripes, 16MB,
to be a large streaming read.  So once our stream counter reaches 32,768
sectors we have the mirror code do alternating reads of 1,024 sectors,
512KB (chunk size), from each disk in the mirror.

Theoretically, this could yield large streaming read performance double
that of streaming write, and double that of the current RAID 1 read
behavior on a per mirror basis.  The trigger value could be statically
defined at array creation time by a yet to be determined formula based
on spindle count and chunk size, or it could be user configurable.

> So as far as I can see, you've got raid15 support already - if that's
> what suits your needs, use it.  Future improvements to the md code are
> only needed to make it faster.

You're too hung up on names and not getting the point.  Whether we call
it RAID 15 or Blue Cheese, if it doesn't have URE mitigation during
rebuild, it's worthless.

> Of course, there is scope for making specific raid15 support in md along
> the lines of the raid10 code - raid15,f2 would have the same speed
> advantages over "normal" raid1+5 as raid10,f2 has over raid1+0.  Whether
> it is worth the effort implementing it is a different matter.

Except the RAID10 driver suffers from the single write thread.  RAID 0
over mirrors doesn't have this problem.  Which is why, along with other
reasons, I proposed a possible RAID 15 driver using the RAID 5 and RAID
1 drivers as the base, as this won't have the single write thread problem.

> I can see plenty of reasons why raid15 might be a good idea, and even
> raid16 for 5 disk redundancy, compared to multi-parity sets.  However,
> it costs a lot in disk space.  
...

Of course it does, just as RAID 10 does.  Parity users who currently
shun RAID 10 for this reason will also shun this "RAID 15".  That's
obvious.  Potential users of RAID 15 are those who value the features of
RAID 10 other than random write performance.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21  9:07               ` David Brown
  2013-11-21  9:54                 ` Adam Goryachev
  2013-11-22  8:13                 ` Stan Hoeppner
@ 2013-11-22  8:38                 ` Stan Hoeppner
  2013-11-22 13:24                   ` David Brown
  2013-11-22 14:19                   ` David Taylor
  2 siblings, 2 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-22  8:38 UTC (permalink / raw)
  To: David Brown, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

On 11/21/2013 3:07 AM, David Brown wrote:

> For example, with 20 disks at 1 TB each, you can have:

All correct, and these are maximum redundancies.

Maximum:

> raid5 = 19TB, 1 disk redundancy
> raid6 = 18TB, 2 disk redundancy
> raid6.3 = 17TB, 3 disk redundancy
> raid6.4 = 16TB, 4 disk redundancy
> raid6.5 = 15TB, 5 disk redundancy


These are not fully correct, because only the minimums are stated.  With
any mirror based array one can lose half the disks as long as no two are
in one mirror.  The probability of a pair failing together is very low,
and this probability decreases even further as the number of drives in
the array increases.  This is one of the many reasons RAID 10 has been
so popular for so many years.

Minimum:

> raid10 = 10TB, 1 disk redundancy
> raid15 = 8TB, 3 disk redundancy
> raid16 = 6TB, 5 disk redundancy

Maximum:

RAID 10 = 10 disk redundancy
RAID 15 = 11 disk redundancy
RAID 16 = 12 disk redundancy

Range:

RAID 10 = 1-10 disk redundancy
RAID 15 = 3-11 disk redundancy
RAID 16 = 5-12 disk redundancy


-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21 23:38                       ` John Williams
@ 2013-11-22  9:35                         ` Stan Hoeppner
  2013-11-22 11:24                           ` joystick
  2013-11-22 15:01                           ` John Williams
  0 siblings, 2 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-22  9:35 UTC (permalink / raw)
  To: John Williams
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

On 11/21/2013 5:38 PM, John Williams wrote:
> On Thu, Nov 21, 2013 at 2:57 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> He wrote that article in late 2009.  It seems pretty clear he wasn't
>> looking 10 years forward to 20TB drives, where the minimum mirror
>> rebuild time will be ~18 hours, and parity rebuild will be much greater.
> 
> Actually, it is completely obvious that he WAS looking ten years
> ahead, seeing as several of his graphs have time scales going to
> 2009+10 = 2019.

Only one graph goes to 2019, the rest are 2010 or less.  That being the
case, his 2019 graph deals with projected reliability of single, double,
and triple parity.

> And he specifically mentions longer rebuild times as one of the
> reasons why higher parity RAIDs are needed.

Yes, he certainly does.  But *only* in the context of the array
surviving for the duration of a rebuild.  He doesn't state that he cares
what the total duration is, he doesn't guess what it might be, nor does
he seem to care about the degraded performance before or during the
rebuild.  He is apparently of the mindset "more parity will save us,
until we need more parity, until we need more parity, until we need
more...".

Following this path, parity will eventually eat more disks of capacity
than RAID10 does today for average array counts, and the only reason for
it being survival of ever increasing rebuild duration.

This is precisely why I proposed "RAID 15".  It gives you the single
disk cloning rebuild speed of RAID 10.  When parity hits 5P then RAID 15
becomes very competitive for smaller arrays.  And since drives at that
point will be 40-50TB each, even small arrays will need lots of
protection against UREs and additional failures during massive rebuild
times.  Here I'd say RAID 15 will beat 5P hands down.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  9:35                         ` Stan Hoeppner
@ 2013-11-22 11:24                           ` joystick
  2013-11-22 15:01                           ` John Williams
  1 sibling, 0 replies; 104+ messages in thread
From: joystick @ 2013-11-22 11:24 UTC (permalink / raw)
  To: stan, linux-raid

On 22/11/2013 10:35, Stan Hoeppner wrote:
> This is precisely why I proposed "RAID 15".  It gives you the single
> disk cloning rebuild speed of RAID 10.  When parity hits 5P then RAID 15
> becomes very competitive for smaller arrays.

Fair enough for smaller arrays, and you can already do that because 
raid15 needs no additional code in the kernel.
What about big arrays?
Your solution is to double the price, and without calculations I'd bet 
is less resilient than 4P for up to at least 200 disks.
So you propose to buy 100 more disks to save some rebuild time?

Note that the rebuild time depends on:
1: disks capacity vs linear read/write speeds, which is the same for 
raid51 and 3+P cases
2: cpu power. But 500MB/sec claimed by Andrea is already pretty good and 
parallelizable on multiple cores. By the time the disks will reach 8TB, 
we will also have many more cores on CPUs. Additionally, probably for 1 
failure only, just the 1st parity would be used for computation, and in 
that case the rebuild speed would approach that of raid5 which is pretty 
fast.

Your solution is also less extensible. After raid15 you only have 
raid16. Additional levels don't follow your philosophy.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  8:13                 ` Stan Hoeppner
@ 2013-11-22 13:15                   ` David Brown
  2013-11-22 16:07                   ` Stan Hoeppner
  2013-11-22 16:50                   ` Mark Knecht
  2 siblings, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-22 13:15 UTC (permalink / raw)
  To: stan, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

On 22/11/13 09:13, Stan Hoeppner wrote:
> Hi David,
> 
> On 11/21/2013 3:07 AM, David Brown wrote:
>> On 21/11/13 02:28, Stan Hoeppner wrote:
> ...
>>> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
>>> to mirror a drive at full streaming bandwidth, assuming 300MB/s
>>> average--and that is probably being kind to the drive makers.  With 6 or
>>> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
>>> minimum 72 hours or more, probably over 100, and probably more yet for
>>> 3P.  And with larger drive count arrays the rebuild times approach a
>>> week.  Whose users can go a week with degraded performance?  This is
>>> simply unreasonable, at best.  I say it's completely unacceptable.
>>>
>>> With these gargantuan drives coming soon, the probability of multiple
>>> UREs during rebuild are pretty high.  Continuing to use ever more
>>> complex parity RAID schemes simply increases rebuild time further.  The
>>> longer the rebuild, the more likely a subsequent drive failure due to
>>> heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
>>> one failure mode we're increasing the probability of another.  TANSTAFL.
>>>  Worse yet, RAID10 isn't going to survive because UREs on a single drive
>>> are increasingly likely with these larger drives, and one URE during
>>> rebuild destroys the array.
> 
> 
>> I don't think the chances of hitting an URE during rebuild is dependent
>> on the rebuild time - merely on the amount of data read during rebuild.
> 
> Please read the above paragraph again, as you misread it the first time.

Yes, I thought you were saying that URE's were more likely during a
parity raid rebuild than during a mirror raid rebuild, because parity
rebuilds take longer.  They will be slightly more likely (due to more
mechanical stress on the drives), but only slightly.

> 
>>  URE rates are "per byte read" rather than "per unit time", are they not?
> 
> These are specified by the drive manufacturer, and they are per *bits*
> read, not "per byte read".  Current consumer drives are typically rated
> at 1 URE in 10^14 bits read, enterprise are 1 in 10^15.

"Per bit" or "per byte" makes no difference to the principle.

Just to get some numbers here, if we have a 20 TB drive (which don't yet
exist, AFAIK - 6 TB is the highest I have heard) with a URE rate of 1 in
10^14, that means an average of 1.6 errors per read of the whole disk.

Assuming bit errors are independent (an unwarranted assumption, I know -
but it makes the maths easier!), an URE of 1 in 10^14 gives a chance of
3.3 * 10^-10 of an error in a 4 KB sector - and a 83% chance of getting
at least one incorrect sector read out of 20 TB.  Even if enterprise
disks have lower URE rates, I think it is reasonable to worry about
URE's during a raid1 rebuild!

The probability of hitting URE's on two disks at the same spot is, of
course, tiny (given that you've got one URE, the chances of a URE in the
same sector on another disk is 3.3 * 10^-10) - so two disk redundancy
lets you have a disk failure and an URE safely.

In theory, mirror raids are safer here because you only need to worry
about a matching URE on /one/ disk.  If you have a parity array with 60
disks, the chances of a matching URE on one of the other disks is 2 *
10^-8 - higher than for mirror raids, but still not a big concern.  (Of
course, you have more chance of a complete disk failure provoked by
stresses during rebuilds, but that's another failure mode.)

What does all this mean?  Single disk redundancy, like 2-way raid1
mirrors, is not going to be good enough for bigger disks unless the
manufacturers can get their URE rates significantly lower.  You will
need an extra redundancy to be safe.  That means raid6 is a minimum, or
3-way mirrors, or stacked raids like raid15.  And if you want to cope
with a disk failure, a second disk failure due to the stresses of
rebuilding, /and/ and URE, then triple parity or raid15 is needed.

> 
>> I think you are overestimating the rebuild times a bit, but there is no
> 
> Which part?  A 20TB drive mirror taking 18 hours, or parity arrays
> taking many times longer than 18 hours?

The 18 hours for a 20 TB mirror sounds right - but that it takes 9 times
as long for a rebuild with a parity array sounds too much.  But I don't
have any figures as evidence.  And of course it varies depending on what
else you are doing with the array at the time - parity array rebuilds
will be affected much more by concurrent access to the array than
mirrored arrays.  It's all a balance - if you want cheaper space but
have less IO's and can tolerate slower rebuilds, then parity arrays are
good.  If you want fast access then raid 15 looks better.

> 
>> arguing that rebuild on parity raids is a lot more work (for the cpu,
>> the IO system, and the disks) than for mirror raids.
> 
> It's not so much a matter of work or interface bandwidth, but a matter
> of serialization and rotational latency.
> 
> ...
>> Shouldn't we be talking about RAID 15 here, rather than RAID 51 ?  I
>> interpret "RAID 15" to be like "RAID 10" - a raid5 set of raid1 mirrors,
>> while "RAID 51" would be a raid1 mirror of raid5 sets.  I am certain
>> that you mean a raid5 set of raid1 pairs - I just think you've got the
>> name wrong.
> 
> Now that you mention it, yes, RAID 15 would fit much better with
> convention.  Not sure why I thought 51.  So it's RAID 15 from here.

Maybe you wanted to use the power of alien technology from Area 51 :-)

But I'm glad we agree on the name.

> 
>>> Potential Advantages:
>>>
>>> 1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count
>>
>> +2 disks (the raid5 parity "disk" is a raid1 pair)
> 
> One drive of each mirror is already gone.  Make a RAID 5 of the
> remaining disks and you lose 1 disk.  So you lose 1 additional disk vs
> RAID 10, not 2.  As I stated previously, for RAID 15 you lose [1/2]+1 of
> your disks to redundancy.

Ah, you meant you lose a disk's worth of capacity in comparison to a
raid10 array with the same number of disks?  I meant you have to add 2
disks to your raid10 array in order to keep the same capacity.  Both are
correct - it's just a different way of looking at it.

Just to be clear, to store data blocks D0, D1, D2, D3 on different raids
you need:

raid0: 4 disks D0, D1, D2, D3
raid10: 8 disks D0a, D0b, D1a, D1b, D2a, D2b, D3a, D3b
raid5: 5 disks D0, D1, D2, D3, P
raid15: 10 disks D0a, D0b, D1a, D1b, D2a, D2b, D3a, D3b, Pa, Pb

> 
> ...
>>> [1]  The RAID1/5 code would need to be patched to properly handle a URE
>>> encountered by the RAID1 code during rebuild.  There are surely other
>>> modifications and/or optimizations that would be needed.  For large
>>> sequential reads, more deterministic read interleaving between mirror
>>> pairs would be a good candidate I think.  IIUC the RAID1 driver does
>>> read interleaving on a per thread basis or some such, which I don't
>>> believe is going to work for this "RAID 51" scenario, at least not for
>>> single streaming reads.  If this can be done well, we double the read
>>> performance of RAID5, and thus we don't completely "waste" all the extra
>>> disks vs big_parity schemes.
>>>
>>> This proposed "RAID level 51" should have drastically lower rebuild
>>> times vs traditional striped parity, should not suffer read/write
>>> performance degradation with most disk failure scenarios, and with a
>>> read interleaving optimization may have significantly greater streaming
>>> read throughput as well.
>>>
>>> This is far from a perfect solution and I am certainly not promoting it
>>> as such.  But I think it does have some serious advantages over
>>> traditional striped parity schemes, and at minimum is worth discussion
>>> as a counterpoint of sorts.
>>
>> I don't see that there needs to be any changes to the existing md code
>> to make raid15 work - it is merely a raid 5 made from a set of raid1
>> pairs.  
> 
> The sole purpose of the parity layer of the proposed RAID 15 is to
> replace sectors lost due to UREs during rebuild.  AFAIK the current RAID
> 5 and RAID 1 drivers have no code to support each other in this manner.

Now I've figure out what you are thinking about - if the raid1 rebuild
fails on a stripe due to an URE, then rather than just marking that
stripe as bad, it should ask the higher raid level (the raid5 here) for
the data.  I don't know if there is any such system in place at the
moment in the kernel code - maybe one of the md code experts here can
tell us.  It should be reasonably easy to implement, I think - what is
needed is for a failure on the lower level raid to trigger a scrub on
the stripe at the higher level raid.  (Of course, the problem will solve
itself next time you do a scrub on the upper raid anyway, but it would
be best to fix it quickly.)

One other optimisation that could be nice here when rebuilding one of
the mirror pairs is to mark the pair "write-mostly", and possibly even
"write-behind".  These flags are currently only valid for raid1 (AFAIK),
and can only be set when building an array.  The idea here is that you
can have a mirror between a local fast drive and a slower drive or a
networked drive - the slow drive will only be used for writes unless
there is a failure on the faster drives, and writes can be buffered
(with "write-behind") if needed.  If a rebuilding mirror pair in raid 15
could be temporarily marked as "write-mostly" and perhaps
"write-behind", then it would be free to dedicate full bandwidth to the
rebuild.  Any reads from that pair would be re-created from the parities
on the other drives.

> 
>> I can see that improved threading and interleaving could be a
>> benefit here - but that's the case in general for md raid, and it is
>> something that the developers are already working on (I haven't followed
>> the details, but the topic comes up regularly on the list here).
> 
> What I'm talking about here is unrelated to the kernel thread starvation
> issue, which is write centric, unrelated to reads.
> 
> What I'm suggesting is that it might be possible to improve the
> concurrency of reads from the mirror disks using some form of static or
> adaptive interleaving or similar.  The purpose of this would be strictly
> to improve large single streaming read performance.  If this could be
> achieved I do not know.
> 
> One possibility may be to count consecutive LBA sectors requested by the
> filesystem stream and compare that to some value.  For example, say we
> have an 18 disk RAID 15 array which gives us 8 spindles.  With a default
> chunk of 512KB this gives us a stripe width of 4MB.  So lets say we
> arbitrarily consider any single stream read larger than 4 stripes, 16MB,
> to be a large streaming read.  So once our stream counter reaches 32,768
> sectors we have the mirror code do alternating reads of 1,024 sectors,
> 512KB (chunk size), from each disk in the mirror.
> 

I see what you mean, and yes, I can see how that could speed up large
accesses.  But I think the same idea applies to a normal raid1 mirror -
a streamed read from it could use interleaved reads from the two halves
to speed up the read.  And if it worked for raid1, then it would
automatically work for raid 15.  I think :-)

Of course, you could always use raid10,f2 or raid10,o2 under raid5 to
get faster read performance (at the cost of slower writes - just like
normal raid10,f2 vs. raid1 comparisons).  Then on your 18-disk "raid105"
array your streamed reads would use 16 spindles.

> Theoretically, this could yield large streaming read performance double
> that of streaming write, and double that of the current RAID 1 read
> behavior on a per mirror basis.  The trigger value could be statically
> defined at array creation time by a yet to be determined formula based
> on spindle count and chunk size, or it could be user configurable.
> 
>> So as far as I can see, you've got raid15 support already - if that's
>> what suits your needs, use it.  Future improvements to the md code are
>> only needed to make it faster.
> 
> You're too hung up on names and not getting the point.  Whether we call
> it RAID 15 or Blue Cheese, if it doesn't have URE mitigation during
> rebuild, it's worthless.

As noted above, I think the current system would work but you need a
scrub at the raid5 level after the raid1 rebuild - and (now that I see
what you are getting at) I agree that this could be done much better
with extra kernel support.

> 
>> Of course, there is scope for making specific raid15 support in md along
>> the lines of the raid10 code - raid15,f2 would have the same speed
>> advantages over "normal" raid1+5 as raid10,f2 has over raid1+0.  Whether
>> it is worth the effort implementing it is a different matter.
> 
> Except the RAID10 driver suffers from the single write thread.  RAID 0
> over mirrors doesn't have this problem.  Which is why, along with other
> reasons, I proposed a possible RAID 15 driver using the RAID 5 and RAID
> 1 drivers as the base, as this won't have the single write thread problem.

Improving the threading of raid10 writes should be perfectly possible
technically - the only problem is someone having the time to do it.  We
are just looking at different ideas here, to gauge the pros and cons of
alternative raid structures.  Ideally we could have /all/ these ideas
implemented, but developer and tester time is the limitation rather than
the technical solution (since it looks like the technical problems of
multi-parity raid have been solved).

> 
>> I can see plenty of reasons why raid15 might be a good idea, and even
>> raid16 for 5 disk redundancy, compared to multi-parity sets.  However,
>> it costs a lot in disk space.  
> ...
> 
> Of course it does, just as RAID 10 does.  Parity users who currently
> shun RAID 10 for this reason will also shun this "RAID 15".  That's
> obvious.  Potential users of RAID 15 are those who value the features of
> RAID 10 other than random write performance.
> 

Yes, both solutions would be useful.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  8:38                 ` Stan Hoeppner
@ 2013-11-22 13:24                   ` David Brown
  2013-11-28  7:16                     ` Stan Hoeppner
  2013-11-22 14:19                   ` David Taylor
  1 sibling, 1 reply; 104+ messages in thread
From: David Brown @ 2013-11-22 13:24 UTC (permalink / raw)
  To: stan, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

On 22/11/13 09:38, Stan Hoeppner wrote:
> On 11/21/2013 3:07 AM, David Brown wrote:
> 
>> For example, with 20 disks at 1 TB each, you can have:
> 
> All correct, and these are maximum redundancies.
> 
> Maximum:
> 
>> raid5 = 19TB, 1 disk redundancy
>> raid6 = 18TB, 2 disk redundancy
>> raid6.3 = 17TB, 3 disk redundancy
>> raid6.4 = 16TB, 4 disk redundancy
>> raid6.5 = 15TB, 5 disk redundancy
> 
> 
> These are not fully correct, because only the minimums are stated.  With
> any mirror based array one can lose half the disks as long as no two are
> in one mirror.  The probability of a pair failing together is very low,
> and this probability decreases even further as the number of drives in
> the array increases.  This is one of the many reasons RAID 10 has been
> so popular for so many years.
> 
> Minimum:
> 
>> raid10 = 10TB, 1 disk redundancy
>> raid15 = 8TB, 3 disk redundancy
>> raid16 = 6TB, 5 disk redundancy
> 
> Maximum:
> 
> RAID 10 = 10 disk redundancy
> RAID 15 = 11 disk redundancy

12 disks maximum (you have 8 with data, the rest are mirrors, parity, or
mirrors of parity).

> RAID 16 = 12 disk redundancy

14 disks maximum (you have 6 with data, the rest are mirrors, parity, or
mirrors of parity).


> 
> Range:
> 
> RAID 10 = 1-10 disk redundancy
> RAID 15 = 3-11 disk redundancy
> RAID 16 = 5-12 disk redundancy
> 
> 

Yes, I know these are the minimum redundancies.  But that's a vital
figure for reliability (even if the range is important for statistical
averages).  When one disk in a raid10 array fails, your main concern is
about failures or URE's in the other half of the pair - it doesn't help
to know that another nine disks can "safely" fail too.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  8:38                 ` Stan Hoeppner
  2013-11-22 13:24                   ` David Brown
@ 2013-11-22 14:19                   ` David Taylor
  1 sibling, 0 replies; 104+ messages in thread
From: David Taylor @ 2013-11-22 14:19 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Fri, 22 Nov 2013, Stan Hoeppner wrote:
>
> With
> any mirror based array one can lose half the disks as long as no two are
> in one mirror.  The probability of a pair failing together is very low,
> and this probability decreases even further as the number of drives in
> the array increases.  This is one of the many reasons RAID 10 has been
> so popular for so many years.

Er, no.

Sure, the probability of "the failed drive's pair" failing decreases
as the number of drives in the array increases IF you assume there will
be exactly one failure.

But adding more drives increases the probability of another drive
failure, and cancels that out.

Quite obviously adding more unrelated drives to an array does not
affect the probability of failure of the pre-existing drives!

(Except perhaps, by over-stressing the PSU and increasing the
chances of all the drives failing)

-- 
David Taylor 

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  9:35                         ` Stan Hoeppner
  2013-11-22 11:24                           ` joystick
@ 2013-11-22 15:01                           ` John Williams
  2013-11-22 22:28                             ` Stan Hoeppner
  1 sibling, 1 reply; 104+ messages in thread
From: John Williams @ 2013-11-22 15:01 UTC (permalink / raw)
  To: stan
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

On Fri, Nov 22, 2013 at 1:35 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

> Only one graph goes to 2019, the rest are 2010 or less.  That being the
> case, his 2019 graph deals with projected reliability of single, double,
> and triple parity.

The whole article goes to 2019 (or longer). He shows current trends
and discusses where they are going in the future. The whole point of
the article is looking ahead into the future.

> Following this path, parity will eventually eat more disks of capacity
> than RAID10 does today for average array counts, and the only reason for
> it being survival of ever increasing rebuild duration.

No, that is not what the article finds. In the near future (about 10
years), triple-parity will suffice. Beyond that, perhaps quad-parity
will be required, but predicting that far ahead is usually worthless
in the computer industry.

> When parity hits 5P then RAID 15
> becomes very competitive for smaller arrays.  And since drives at that
> point will be 40-50TB each, even small arrays will need lots of
> protection against UREs and additional failures during massive rebuild
> times.  Here I'd say RAID 15 will beat 5P hands down.

I'll take triple- or 4-parity every time over the disk-wasting and
less reliable RAID 15. There is no need for 5-parity in the near
future.  I see no advantage of RAID 15, and several disadvantages.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  8:13                 ` Stan Hoeppner
  2013-11-22 13:15                   ` David Brown
@ 2013-11-22 16:07                   ` Stan Hoeppner
  2013-11-22 22:59                     ` NeilBrown
  2013-11-22 16:50                   ` Mark Knecht
  2 siblings, 1 reply; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-22 16:07 UTC (permalink / raw)
  To: David Brown, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

On 11/22/2013 2:13 AM, Stan Hoeppner wrote:
> Hi David,
> 
> On 11/21/2013 3:07 AM, David Brown wrote:
...
>> I don't see that there needs to be any changes to the existing md code
>> to make raid15 work - it is merely a raid 5 made from a set of raid1
>> pairs.  
> 
> The sole purpose of the parity layer of the proposed RAID 15 is to
> replace sectors lost due to UREs during rebuild.  AFAIK the current RAID
> 5 and RAID 1 drivers have no code to support each other in this manner.

Minor self correction here-- obviously this isn't the 'sole' purpose of
the parity layer.  It also allows us to recover from losing an entire
mirror, which is a big upshot of the proposed RAID 15.  Thinking this
through a little further, more code modification would be needed for
this scenario.

In the event of a double drive failure in one mirror, the RAID 1 code
will need to be modified in such a way as to allow the RAID 5 code to
rebuild the first replacement disk, because the RAID 1 device is still
in a failed state.  Once this rebuild is complete, the RAID 1 code will
need to switch the state to degraded, and then do its standard rebuild
routine for the 2nd replacement drive.

Or, with some (likely major) hacking it should be possible to rebuild
both drives simultaneously for no loss of throughput or additional
elapsed time on the RAID 5 rebuild.  In the 20TB drive case, this would
shave 18 hours off the total rebuild operation elapsed time.  With
current 4TB drives it would still save 6.5 hours.  Losing both drives in
one mirror set of a striped array is rare, but given the rebuild time
saved it may be worth investigating during any development of this RAID
15 idea.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  8:13                 ` Stan Hoeppner
  2013-11-22 13:15                   ` David Brown
  2013-11-22 16:07                   ` Stan Hoeppner
@ 2013-11-22 16:50                   ` Mark Knecht
  2013-11-22 19:51                     ` Duncan
  2 siblings, 1 reply; 104+ messages in thread
From: Mark Knecht @ 2013-11-22 16:50 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: David Brown, James Plank, Ric Wheeler, Andrea Mazzoleni,
	H. Peter Anvin, Linux-RAID, linux-btrfs, David Smith

On Fri, Nov 22, 2013 at 12:13 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> Hi David,
>
> On 11/21/2013 3:07 AM, David Brown wrote:
<SNIP>
>> Shouldn't we be talking about RAID 15 here, rather than RAID 51 ?  I
>> interpret "RAID 15" to be like "RAID 10" - a raid5 set of raid1 mirrors,
>> while "RAID 51" would be a raid1 mirror of raid5 sets.  I am certain
>> that you mean a raid5 set of raid1 pairs - I just think you've got the
>> name wrong.
>
> Now that you mention it, yes, RAID 15 would fit much better with
> convention.  Not sure why I thought 51.  So it's RAID 15 from here.
<SNIP>

For us casual readers & RAID users could you clarify RAID15? Would
that be a bunch of RAID1's grouped together in what appears to be a
RAID5 to the system?

Thanks,
Mark

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22 16:50                   ` Mark Knecht
@ 2013-11-22 19:51                     ` Duncan
  0 siblings, 0 replies; 104+ messages in thread
From: Duncan @ 2013-11-22 19:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-raid

Mark Knecht posted on Fri, 22 Nov 2013 08:50:32 -0800 as excerpted:

> On Fri, Nov 22, 2013 at 12:13 AM, Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>> Now that you mention it, yes, RAID 15 would fit much better with
>> convention.  Not sure why I thought 51.  So it's RAID 15 from here.
> <SNIP>
> 
> For us casual readers & RAID users could you clarify RAID15? Would that
> be a bunch of RAID1's grouped together in what appears to be a RAID5 to
> the system?

Simplest definition, yes.

Admittedly part of this discussion is beyond me (as another casual reader 
with some raid experience, reading here via the btrfs list as that's my 
current interest), but I'm following enough of it to find it interesting, 
for SURE! =:^)

And perhaps my explanation of the basics will let the real experts 
continue the debate at their higher level...

At a concept level, because md/raid, etc (I'll use mdraid as my example 
from here, but but there's dm-raid, hardware raid, etc; additionally, 
I'll omit the ALL CAPS RAID convention and use lowercase), devices are 
presented as normal block devices, RAID levels (among other things, LVM2, 
etc) are stackable.  So it's possible to, for instance, create a raid0 on 
top of a bunch of raid1s, or the reverse, a raid1 on top of a bunch of 
raid0s, either with the base level being hardware based and the software 
creating a raid level direct on the hardware raid, or with both/all 
levels in software.

Then we get into naming.  AFAIK the earliest convention was using the 
plus syntax, raid1+0, raid0+1, with the left-most number being the 
lowest, closest to hardware level, either the hardware level or closest 
to the individual hardware devices, so raid1+0 is implemented as striped 
raid (raid0) over top of mirrored raid (raid1), with raid0+1 the reverse, 
a mirror over stripes.

That quickly evolved into omitting the +, thus raid10 and raid01. (Tho 01 
has the leading zero problem with some people trying to omit it, and 
raid1 isn't the same thing AT ALL as raid01!  Between that and the fact 
that raid01 is less common than raid10 for technical reasons as noted 
below, you seldom see raid01 specified; it usually keeps the + and 
appears as raid0+1).

Also, less commonly seen but as more levels were stacked (raid105, etc), 
sometimes the + is still used to separate the hardware raid levels from 
software.  In this usage, raid105 would probably be an all software 
implementation, while raid1+05 would be raid1 in hardware, with software 
raid0 and raid5 stacked on top, in that order, and raid10+5 would be 
hardware raid10, with software raid5 on top.

Note that while raid10, aka raid1+0, should have similar non-degraded 
performance to raid0+1, there's a BIG difference when recovering from 
degraded.  A smart raid10 implementation (or a raid1+0 with hardware 
raid1) can rebuild a failed drive "locally", that is, purely at the raid1 
level, using just the data on its raid1 mirror(s).  That means only a 
single device has to be read from in ordered to write the data to the 
rebuilding device.  Raid0+1, by contrast, fails the entire raid0 level at 
once, thus requiring reading from an unfailed entire raid1 (higher) level 
mirror set while writing out an entire new raid0 set!!  So while normal 
operation state is similar between raid10/raid1+0 and raid0+1, the 
recovery characteristics are **MUCH** different, with raid10 being 
markedly better than raid0+1.  As a result, raid0+1 doesn't tend to be 
used that often in practice, while raid10 (aka raid1+0) has become quite 
common, particularly so as its performance is quite high, only exceeded 
by raid0, but with redundancy and recovery characteristics that are good 
to very good, as well.  Its biggest negative at the low end is the number 
of devices required, normally a minimum of four (but see the Linux 
mdraid10 discussion below), a striped pair of mirrored pairs.

This 1+0/0+1 distinction confused me as an early raid user for quite some 
time even after I knew the technical difference, as I kept trying to 
reverse them in my head, and I guess it confuses a lot of people.  For 
some reason, my intuitive read of raid10 was the reverse of convention -- 
intuitively I /wanted/ to interpret it as a raid1 on top of raid0 instead 
of the raid0 on top of raid1 it is by convention, and even after I 
understood that there WAS a difference and in principle knew why and how, 
for years I actually had to look up the difference each time it came up, 
if it made a difference to the discussion, because I /wanted/ to read it 
backward, or more accurately, I thought the convention had it backward to 
the interpretation that made most sense to me.  It is only recently that 
I came to see it the other way, and even still, I have to pause and think 
every time I see it, to ensure I'm not again reversing things.

Which is the distinction that came up in the above discussion as well, 
only with raid5 and raid1 instead of raid0 and raid1.  Apparently I'm not 
the only one to get things reversed!

But yes, conceptually, raid15 is a raid5 layer on top of raid1, aka raid1
+5, while raid51 would be a raid1 layer on top of raid5, aka raid5+1.  
For the same recovery-time reasons noted above with raid0+1 vs. raid1+0/
raid10, having raid1 at the local/hardware layer should be preferable.

With the basic concepts covered, the next level up is understanding that 
the Linux md/raid10 implementation, while BASED on the raid10 concept 
above, has some quite interesting extensions.  Implementing it as a 
single software raid10 level instead of separate raid0 over raid1 allows 
some interesting optimizations and additional flexibility.  Among other 
things, it no longer requires a minimum four devices (a raid0 pair of 
raid1 pairs) as separate raid0 over raid1 would.  There's quite a bit of 
additional flexibility in layout.  A detailed discussion is out of scope 
here, but googling raid10 on wikipedia is a good start, and the page it 
gives you actually discusses various other nested raid levels as well.  
From there, follow the links to non-standard raid levels, and to the 
Linux mdraid implementation discussion, including the concepts of "near", 
"far", and "offset" layouts.

https://en.wikipedia.org/wiki/Raid10

https://en.wikipedia.org/wiki/Non-standard_RAID_levels

https://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

But the discussion here is well beyond that, out toward further 
implementation detail and optimization.

One of the problems that has been creeping up on us is the fact that as 
shear drive sizes increase, the possibility of undetected/uncorrected 
physical device errors goes up faster than the technology gets better at 
reducing them.  For "simple" parity RAID solutions such as raid5, this is 
a rather big problem, because at some point, the chances of error during 
recovery scuttling the recovery entirely simply get too large to 
practically deal with, with recovery time (and thus time to recovery 
failure and try again) similarly increasing toward the days and weeks 
point.  If recovery's going to take days, only for it to fail due to 
physical device error forcing another try...

So the discussion is how to mitigate the problem.  Multi-way-parity is of 
course the primary discussion in this thread, allowing detection and 
recovery of single-sector physical device errors via N-way-parity. 

But an integrated raid15 solution similar to mdraid's current raid10, is 
another possibility, effectively using the raid1 mirror level to mitigate 
sector-level physical device errors, while using the higher raid5 level 
to detect them and trigger a re-mirror at the raid1 level below it.  But 
the only way that can work is if the two conceptually separate raid 
levels are integrated at the implementation level, so the raid5 level 
parity error detection can tell the raid1 level which of its mirrors is 
bad and force a remirroring from the good one(s) to the bad one.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  0:32               ` David Brown
@ 2013-11-22 20:32                 ` Piergiorgio Sartor
  0 siblings, 0 replies; 104+ messages in thread
From: Piergiorgio Sartor @ 2013-11-22 20:32 UTC (permalink / raw)
  To: David Brown
  Cc: Piergiorgio Sartor, Andrea Mazzoleni, linux-raid, linux-btrfs,
	hpa, creamyfish

Hi David,

On Fri, Nov 22, 2013 at 01:32:09AM +0100, David Brown wrote:
> > One typical case is when many errors are
> > found, belonging to the same disk.
> > This case clearly shows the disk is to be
> > replaced or the interface checked...
> > But, again, the user is the master, not the
> > machine... :-)
> 
> I don't know what sort of interface you have for the user, but I guess
> that means you'll have to collect a number of failures before showing
> them so that the user can see the correlation on disk number.

as usual in Unix, one software will collect
data to a file, an other one will analyze
that file.
Originally, one idea was even to check at
stripe level how many errors (and where)
are present. From that some statistics will
be presented to the user.
This would be integrated in the check tool,
of course.

> >> For most ECC schemes, you know that all your blocks are set
> >> synchronously - so any block that does not fit in, is an error.  With
> >> raid, it could also be that a stripe is only partly written - you can
> > 
> > Could it be?
> > I would consider this an error.
> 
> It could occur as the result of a failure of some sort (kernel crash,
> power failure, temporary disk problem, etc.).  More generally, md raid
> doesn't have to be on local physical disks - maybe one of the "disks" is
> an iSCSI drive or something else over a network that could have failures
> or delays.  I haven't thought through all cases here - I am just
> throwing them out as possibilities that might cause trouble.

OK, I misunderstood you, I was thinking during
normal operation...
Again, the check can find that issue, it will
tell that it cannot find where the problem is.
But it will tell where.
Possibly, an other tool can check the FS at
that position.

bye, 

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22 15:01                           ` John Williams
@ 2013-11-22 22:28                             ` Stan Hoeppner
  0 siblings, 0 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-22 22:28 UTC (permalink / raw)
  To: John Williams
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

On 11/22/2013 9:01 AM, John Williams wrote:
<snip>

> I see no advantage of RAID 15, and several disadvantages.

Of course not, just as I sated previously.

On 11/22/2013 2:13 AM, Stan Hoeppner wrote:

> Parity users who currently shun RAID 10 for this reason will also
> shun this "RAID 15".

With that I'll thank you for your input from the pure parity
perspective, and end our discussion.  Any further exchange would be
pointless.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22 16:07                   ` Stan Hoeppner
@ 2013-11-22 22:59                     ` NeilBrown
  2013-11-23 17:39                       ` David Brown
  0 siblings, 1 reply; 104+ messages in thread
From: NeilBrown @ 2013-11-22 22:59 UTC (permalink / raw)
  To: stan
  Cc: David Brown, James Plank, Ric Wheeler, Andrea Mazzoleni,
	H. Peter Anvin, linux-raid, linux-btrfs, David Smith

[-- Attachment #1: Type: text/plain, Size: 2241 bytes --]

On Fri, 22 Nov 2013 10:07:09 -0600 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 11/22/2013 2:13 AM, Stan Hoeppner wrote:
> > Hi David,
> > 
> > On 11/21/2013 3:07 AM, David Brown wrote:
> ...
> >> I don't see that there needs to be any changes to the existing md code
> >> to make raid15 work - it is merely a raid 5 made from a set of raid1
> >> pairs.  
> > 
> > The sole purpose of the parity layer of the proposed RAID 15 is to
> > replace sectors lost due to UREs during rebuild.  AFAIK the current RAID
> > 5 and RAID 1 drivers have no code to support each other in this manner.
> 
> Minor self correction here-- obviously this isn't the 'sole' purpose of
> the parity layer.  It also allows us to recover from losing an entire
> mirror, which is a big upshot of the proposed RAID 15.  Thinking this
> through a little further, more code modification would be needed for
> this scenario.
> 
> In the event of a double drive failure in one mirror, the RAID 1 code
> will need to be modified in such a way as to allow the RAID 5 code to
> rebuild the first replacement disk, because the RAID 1 device is still
> in a failed state.  Once this rebuild is complete, the RAID 1 code will
> need to switch the state to degraded, and then do its standard rebuild
> routine for the 2nd replacement drive.
> 
> Or, with some (likely major) hacking it should be possible to rebuild
> both drives simultaneously for no loss of throughput or additional
> elapsed time on the RAID 5 rebuild. 

Nah, that would be minor hacking.  Just recreate the RAID1 in a state that is
not-insync, but with automatic-resync disabled.
Then as continuous writes arrive, move the "recovery_cp" variable forward
towards the end of the array.  When it reaches the end we can safely mark the
whole array as 'in-sync' and forget about diabling auto-resync.

NeilBrown



 In the 20TB drive case, this would
> shave 18 hours off the total rebuild operation elapsed time.  With
> current 4TB drives it would still save 6.5 hours.  Losing both drives in
> one mirror set of a striped array is rare, but given the rebuild time
> saved it may be worth investigating during any development of this RAID
> 15 idea.
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21 22:57                     ` Stan Hoeppner
  2013-11-21 23:38                       ` John Williams
@ 2013-11-22 23:07                       ` NeilBrown
  2013-11-23  3:46                         ` Stan Hoeppner
  1 sibling, 1 reply; 104+ messages in thread
From: NeilBrown @ 2013-11-22 23:07 UTC (permalink / raw)
  To: stan
  Cc: John Williams, James Plank, Ric Wheeler, Andrea Mazzoleni,
	H. Peter Anvin, Linux RAID Mailing List, Btrfs BTRFS, David Brown,
	David Smith

[-- Attachment #1: Type: text/plain, Size: 1909 bytes --]

On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 11/21/2013 1:05 AM, John Williams wrote:
> > On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> >> On 11/20/2013 8:46 PM, John Williams wrote:
> >>> For myself or any machines I managed for work that do not need high
> >>> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
> >>> similar schemes with arrays of 16 - 32 drives.
> >>
> >> You must see a week long rebuild as acceptable...
> > 
> > It would not be a problem if it did take that long, since I would have
> > extra parity units as backup in case of a failure during a rebuild.
> > 
> > But of course it would not take that long. Take, for example, a 24 x
> > 3TB triple-parity array (21+3) that has had two drive failures
> > (perhaps the rebuild started with one failure, but there was soon
> > another failure). I would expect the rebuild to take about a day.
> 
> You're looking at today.  We're discussing tomorrow's needs.  Today's
> 6TB 3.5" drives have sustained average throughput of ~175MB/s.
> Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
> previously, at that rate a straight disk-disk copy of a 20TB drive takes
> 18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
> rebuilding a failed drive in a 3P array of say 8 of these disks will
> likely take at least 3 times as long, 2 days 6 hours minimum, probably
> more.  This may be perfectly acceptable to some, but probably not to all.

Could you explain your logic here?  Why do you think rebuilding parity
will take 3 times as long as rebuilding a copy?  Can you measure that sort of
difference today?

Presumably when we have 20TB drives we will also have more cores and quite
possibly dedicated co-processors which will make the CPU load less
significant.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22 23:07                       ` NeilBrown
@ 2013-11-23  3:46                         ` Stan Hoeppner
  2013-11-23  5:04                           ` NeilBrown
  0 siblings, 1 reply; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-23  3:46 UTC (permalink / raw)
  To: NeilBrown
  Cc: John Williams, James Plank, Ric Wheeler, Andrea Mazzoleni,
	H. Peter Anvin, Linux RAID Mailing List, Btrfs BTRFS, David Brown,
	David Smith

On 11/22/2013 5:07 PM, NeilBrown wrote:
> On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
> 
>> On 11/21/2013 1:05 AM, John Williams wrote:
>>> On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>>> On 11/20/2013 8:46 PM, John Williams wrote:
>>>>> For myself or any machines I managed for work that do not need high
>>>>> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
>>>>> similar schemes with arrays of 16 - 32 drives.
>>>>
>>>> You must see a week long rebuild as acceptable...
>>>
>>> It would not be a problem if it did take that long, since I would have
>>> extra parity units as backup in case of a failure during a rebuild.
>>>
>>> But of course it would not take that long. Take, for example, a 24 x
>>> 3TB triple-parity array (21+3) that has had two drive failures
>>> (perhaps the rebuild started with one failure, but there was soon
>>> another failure). I would expect the rebuild to take about a day.
>>
>> You're looking at today.  We're discussing tomorrow's needs.  Today's
>> 6TB 3.5" drives have sustained average throughput of ~175MB/s.
>> Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
>> previously, at that rate a straight disk-disk copy of a 20TB drive takes
>> 18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
>> rebuilding a failed drive in a 3P array of say 8 of these disks will
>> likely take at least 3 times as long, 2 days 6 hours minimum, probably
>> more.  This may be perfectly acceptable to some, but probably not to all.
> 
> Could you explain your logic here?  Why do you think rebuilding parity
> will take 3 times as long as rebuilding a copy?  Can you measure that sort of
> difference today?

I've not performed head-to-head timed rebuild tests of mirror vs parity
RAIDs.  I'm making the elapsed guess for parity RAIDs based on posts
here over the past ~3 years, in which many users reported 16-24+ hour
rebuild times for their fairly wide (12-16 1-2TB drive) RAID6 arrays.

This is likely due to their chosen rebuild priority and concurrent user
load during rebuild.  Since this seems to be the norm, instead of giving
100% to the rebuild, I thought it prudent to take this into account,
instead of the theoretical minimum rebuild time.

> Presumably when we have 20TB drives we will also have more cores and quite
> possibly dedicated co-processors which will make the CPU load less
> significant.

But (when) will we have the code to fully take advantage of these?  It's
nearly 2014 and we still don't have a working threaded write model for
levels 5/6/10, though maybe soon.  Multi-core mainstream x86 CPUs have
been around for 8 years now, SMP and ccNUMA systems even longer.  So the
need has been there for a while.

I'm strictly making an observation (possibly not fully accurate) here.
I am not casting stones.  I'm not a programmer and am thus unable to
contribute code, only ideas and troubleshooting assistance for fellow
users.  Ergo I have no right/standing to complain about the rate of
feature progress.  I know that everyone hacking md is making the most of
the time they have available.  So again, not a complaint, just an
observation.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-23  3:46                         ` Stan Hoeppner
@ 2013-11-23  5:04                           ` NeilBrown
  2013-11-23  5:34                             ` John Williams
  0 siblings, 1 reply; 104+ messages in thread
From: NeilBrown @ 2013-11-23  5:04 UTC (permalink / raw)
  To: stan
  Cc: John Williams, James Plank, Ric Wheeler, Andrea Mazzoleni,
	H. Peter Anvin, Linux RAID Mailing List, Btrfs BTRFS, David Brown,
	David Smith

[-- Attachment #1: Type: text/plain, Size: 4184 bytes --]

On Fri, 22 Nov 2013 21:46:50 -0600 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 11/22/2013 5:07 PM, NeilBrown wrote:
> > On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner <stan@hardwarefreak.com>
> > wrote:
> > 
> >> On 11/21/2013 1:05 AM, John Williams wrote:
> >>> On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> >>>> On 11/20/2013 8:46 PM, John Williams wrote:
> >>>>> For myself or any machines I managed for work that do not need high
> >>>>> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
> >>>>> similar schemes with arrays of 16 - 32 drives.
> >>>>
> >>>> You must see a week long rebuild as acceptable...
> >>>
> >>> It would not be a problem if it did take that long, since I would have
> >>> extra parity units as backup in case of a failure during a rebuild.
> >>>
> >>> But of course it would not take that long. Take, for example, a 24 x
> >>> 3TB triple-parity array (21+3) that has had two drive failures
> >>> (perhaps the rebuild started with one failure, but there was soon
> >>> another failure). I would expect the rebuild to take about a day.
> >>
> >> You're looking at today.  We're discussing tomorrow's needs.  Today's
> >> 6TB 3.5" drives have sustained average throughput of ~175MB/s.
> >> Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
> >> previously, at that rate a straight disk-disk copy of a 20TB drive takes
> >> 18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
> >> rebuilding a failed drive in a 3P array of say 8 of these disks will
> >> likely take at least 3 times as long, 2 days 6 hours minimum, probably
> >> more.  This may be perfectly acceptable to some, but probably not to all.
> > 
> > Could you explain your logic here?  Why do you think rebuilding parity
> > will take 3 times as long as rebuilding a copy?  Can you measure that sort of
> > difference today?
> 
> I've not performed head-to-head timed rebuild tests of mirror vs parity
> RAIDs.  I'm making the elapsed guess for parity RAIDs based on posts
> here over the past ~3 years, in which many users reported 16-24+ hour
> rebuild times for their fairly wide (12-16 1-2TB drive) RAID6 arrays.

I guess with that many drives you could hit PCI bus throughput limits.

A 16-lane PCIe 4.0 could just about give 100MB/s to each of 16 devices.  So
you would really need top-end hardware to keep all of 16 drives busy in a
recovery.
So yes: rebuilding a drive in a 16-drive RAID6+ would be slower than in e.g.
a 20 drive RAID10.

> 
> This is likely due to their chosen rebuild priority and concurrent user
> load during rebuild.  Since this seems to be the norm, instead of giving
> 100% to the rebuild, I thought it prudent to take this into account,
> instead of the theoretical minimum rebuild time.
> 
> > Presumably when we have 20TB drives we will also have more cores and quite
> > possibly dedicated co-processors which will make the CPU load less
> > significant.
> 
> But (when) will we have the code to fully take advantage of these?  It's
> nearly 2014 and we still don't have a working threaded write model for
> levels 5/6/10, though maybe soon.  Multi-core mainstream x86 CPUs have
> been around for 8 years now, SMP and ccNUMA systems even longer.  So the
> need has been there for a while.

I think we might have that multi-threading now - not sure exactly what is
enabled by default though.

I think it requires more than "need" - it requires "demand".  i.e. people
repeatedly expressing the need.  We certainly have had that for a while, but
not a very long while


> 
> I'm strictly making an observation (possibly not fully accurate) here.
> I am not casting stones.  I'm not a programmer and am thus unable to
> contribute code, only ideas and troubleshooting assistance for fellow
> users.  Ergo I have no right/standing to complain about the rate of
> feature progress.  I know that everyone hacking md is making the most of
> the time they have available.  So again, not a complaint, just an
> observation.

Understood - and thanks for your observation.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-23  5:04                           ` NeilBrown
@ 2013-11-23  5:34                             ` John Williams
  2013-11-23  7:12                               ` NeilBrown
  0 siblings, 1 reply; 104+ messages in thread
From: John Williams @ 2013-11-23  5:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: stan, James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

On Fri, Nov 22, 2013 at 9:04 PM, NeilBrown <neilb@suse.de> wrote:

> I guess with that many drives you could hit PCI bus throughput limits.
>
> A 16-lane PCIe 4.0 could just about give 100MB/s to each of 16 devices.  So
> you would really need top-end hardware to keep all of 16 drives busy in a
> recovery.
> So yes: rebuilding a drive in a 16-drive RAID6+ would be slower than in e.g.
> a 20 drive RAID10.

Not really. A single 8x PCIe 2.0 card has 8 x 500MB/s = 4000MB/s of
potential bandwidth. That would be 250MB/s per drive for 16 drives.

But quite a few people running software RAID with many drives have
multiple PCIe cards. For example, in one machine I have three IBM
M1015 cards (which I got for $75/ea) that are 8x PCIe 2.0. That comes
to 3 x 500MB/s x 8 = 12GB/s of IO bandwidth.

Also, your math is wrong. PCIe 3.0 is 985 MB/s per lane. If we assume
PCIe 4.0 would double that, we would have 1970MB/s per lane. So one
lane of the hypothetical PCIe 4.0 would have enough IO bandwidth to
give about 120MB/s to each of 16 drives. A single 8x PCIe 4.0 card
would have 8 times that capability which is more than 15GB/s.

Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth.

Bottom line is that IO bandwidth is not a problem for a system with
prudently chosen hardware.

More likely is that you would be CPU limited (rather than bus limited)
in a high-parity rebuild where more than one drive failed. But even
that is not likely to be too bad, since Andrea's single-threaded
recovery code can recover two drives at nearly 1GB/s on one of my
machines. I think the code could probably be threaded to achieve a
multiple of that running on multiple cores.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-23  5:34                             ` John Williams
@ 2013-11-23  7:12                               ` NeilBrown
  2013-11-24  4:03                                 ` Stan Hoeppner
  0 siblings, 1 reply; 104+ messages in thread
From: NeilBrown @ 2013-11-23  7:12 UTC (permalink / raw)
  To: John Williams
  Cc: stan, James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

[-- Attachment #1: Type: text/plain, Size: 2225 bytes --]

On Fri, 22 Nov 2013 21:34:41 -0800 John Williams <jwilliams4200@gmail.com>
wrote:

> On Fri, Nov 22, 2013 at 9:04 PM, NeilBrown <neilb@suse.de> wrote:
> 
> > I guess with that many drives you could hit PCI bus throughput limits.
> >
> > A 16-lane PCIe 4.0 could just about give 100MB/s to each of 16 devices.  So
> > you would really need top-end hardware to keep all of 16 drives busy in a
> > recovery.
> > So yes: rebuilding a drive in a 16-drive RAID6+ would be slower than in e.g.
> > a 20 drive RAID10.
> 
> Not really. A single 8x PCIe 2.0 card has 8 x 500MB/s = 4000MB/s of
> potential bandwidth. That would be 250MB/s per drive for 16 drives.
> 
> But quite a few people running software RAID with many drives have
> multiple PCIe cards. For example, in one machine I have three IBM
> M1015 cards (which I got for $75/ea) that are 8x PCIe 2.0. That comes
> to 3 x 500MB/s x 8 = 12GB/s of IO bandwidth.
> 
> Also, your math is wrong. PCIe 3.0 is 985 MB/s per lane. If we assume
> PCIe 4.0 would double that, we would have 1970MB/s per lane. So one
> lane of the hypothetical PCIe 4.0 would have enough IO bandwidth to
> give about 120MB/s to each of 16 drives. A single 8x PCIe 4.0 card
> would have 8 times that capability which is more than 15GB/s.

It wasn't my math, it was my reading :-(
16-lane PCIe 4.0 is 31 GB/sec so 2GB/sec per drive.  I was reading the
"1-lane" number...

> 
> Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth.
> 
> Bottom line is that IO bandwidth is not a problem for a system with
> prudently chosen hardware.
> 
> More likely is that you would be CPU limited (rather than bus limited)
> in a high-parity rebuild where more than one drive failed. But even
> that is not likely to be too bad, since Andrea's single-threaded
> recovery code can recover two drives at nearly 1GB/s on one of my
> machines. I think the code could probably be threaded to achieve a
> multiple of that running on multiple cores.

Indeed.  It seems likely that with modern hardware, the  linear write speed
would be the limiting factor for spinning-rust drives.
For SSDs the limit might end up being somewhere else ...

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 22:29 ` Piergiorgio Sartor
@ 2013-11-23  7:55   ` Andrea Mazzoleni
  2013-11-23 22:10     ` Piergiorgio Sartor
  0 siblings, 1 reply; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-23  7:55 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Linux RAID Mailing List, Btrfs BTRFS, H. Peter Anvin, David Brown,
	David Smith

Hi Piergiorgio,

> How about par2? How does this work?
I checked the matrix they use, and sometimes it contains some singular
square submatrix.
It seems that in GF(2^16) these cases are just less common. Maybe they
were just unnoticed.

Anyway, this seems to be an already known problem for PAR2, with an
hypothetical PAR3 fixing it:

http://sourceforge.net/p/parchive/discussion/96282/thread/d3c6597b/

Ciao,
Andrea

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22 22:59                     ` NeilBrown
@ 2013-11-23 17:39                       ` David Brown
  0 siblings, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-23 17:39 UTC (permalink / raw)
  To: NeilBrown
  Cc: stan, James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	linux-raid, linux-btrfs, David Smith

On 22/11/13 23:59, NeilBrown wrote:
> On Fri, 22 Nov 2013 10:07:09 -0600 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>
<snip>
>> In the event of a double drive failure in one mirror, the RAID 1 code
>> will need to be modified in such a way as to allow the RAID 5 code to
>> rebuild the first replacement disk, because the RAID 1 device is still
>> in a failed state.  Once this rebuild is complete, the RAID 1 code will
>> need to switch the state to degraded, and then do its standard rebuild
>> routine for the 2nd replacement drive.
>>
>> Or, with some (likely major) hacking it should be possible to rebuild
>> both drives simultaneously for no loss of throughput or additional
>> elapsed time on the RAID 5 rebuild.
>
> Nah, that would be minor hacking.  Just recreate the RAID1 in a state that is
> not-insync, but with automatic-resync disabled.
> Then as continuous writes arrive, move the "recovery_cp" variable forward
> towards the end of the array.  When it reaches the end we can safely mark the
> whole array as 'in-sync' and forget about diabling auto-resync.
>
> NeilBrown
>

That was my thoughts here.  I don't know what state the planned "bitmap 
of non-sync regions" feature is in, but if and when it is implemented, 
you would just create the replacement raid1 pair without any 
synchronisation.  Any writes to the pair (such as during a raid5 
rebuild) would get written to both disks at the same time.

David



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-23  7:55   ` Andrea Mazzoleni
@ 2013-11-23 22:10     ` Piergiorgio Sartor
  2013-11-24  9:39       ` Andrea Mazzoleni
  0 siblings, 1 reply; 104+ messages in thread
From: Piergiorgio Sartor @ 2013-11-23 22:10 UTC (permalink / raw)
  To: Andrea Mazzoleni
  Cc: Piergiorgio Sartor, Linux RAID Mailing List, Btrfs BTRFS,
	H. Peter Anvin, David Brown, David Smith

Hi Andrea,

On Sat, Nov 23, 2013 at 08:55:08AM +0100, Andrea Mazzoleni wrote:
> Hi Piergiorgio,
> 
> > How about par2? How does this work?
> I checked the matrix they use, and sometimes it contains some singular
> square submatrix.
> It seems that in GF(2^16) these cases are just less common. Maybe they
> were just unnoticed.
> 
> Anyway, this seems to be an already known problem for PAR2, with an
> hypothetical PAR3 fixing it:
> 
> http://sourceforge.net/p/parchive/discussion/96282/thread/d3c6597b/

you did a pretty damn good research work!

Maybe you should consider to contact them too.
I'm not sure if your approach can be extended
to GF(2^16), I guess yes, in that case they
might be interested too.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-23  7:12                               ` NeilBrown
@ 2013-11-24  4:03                                 ` Stan Hoeppner
  2013-11-24  5:14                                   ` John Williams
  2013-11-24  5:19                                   ` Russell Coker
  0 siblings, 2 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-24  4:03 UTC (permalink / raw)
  To: NeilBrown, John Williams
  Cc: James Plank, Ric Wheeler, Andrea Mazzoleni, H. Peter Anvin,
	Linux RAID Mailing List, Btrfs BTRFS, David Brown, David Smith

On 11/23/2013 1:12 AM, NeilBrown wrote:
> On Fri, 22 Nov 2013 21:34:41 -0800 John Williams <jwilliams4200@gmail.com>

>> Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth.
>>
>> Bottom line is that IO bandwidth is not a problem for a system with
>> prudently chosen hardware.

Quite right.

>> More likely is that you would be CPU limited (rather than bus limited)
>> in a high-parity rebuild where more than one drive failed. But even
>> that is not likely to be too bad, since Andrea's single-threaded
>> recovery code can recover two drives at nearly 1GB/s on one of my
>> machines. I think the code could probably be threaded to achieve a
>> multiple of that running on multiple cores.
> 
> Indeed.  It seems likely that with modern hardware, the  linear write speed
> would be the limiting factor for spinning-rust drives.

Parity array rebuilds are read-modify-write operations.  The main
difference from normal operation RMWs is that the write is always to the
same disk.  As long as the stripe reads and chunk reconstruction outrun
the write throughput then the rebuild speed should be as fast as a
mirror rebuild.  But this doesn't appear to be what people are
experiencing.  Parity rebuilds would seem to take much longer.

I have always surmised that the culprit is rotational latency, because
we're not able to get a real sector-by-sector streaming read from each
drive.  If even only one disk in the array has to wait for the platter
to come round again, the entire stripe read is slowed down by an
additional few milliseconds.  For example, in an 8 drive array let's say
each stripe read is slowed 5ms by only one of the 7 drives due to
rotational latency, maybe acoustical management, or some other firmware
hiccup in the drive.  This slows down the entire stripe read because we
can't do parity reconstruction until all chunks are in.  An 8x 2TB array
with 512KB chunk has 4 million stripes of 4MB each.  Reading 4M stripes,
that extra 5ms per stripe read costs us

(4,000,000 * 0.005)/3600 = 5.56 hours

Now consider that arrays typically have a few years on them before the
first drive failure.  During our rebuild it's likely that some drives
will take a few rotations to return a sector that's marginal.  So  this
might slow down a stripe read by dozens of milliseconds, maybe a full
second.  If this happens to multiple drives many times throughout the
rebuild it will add even more elapsed time, possibly additional hours.

Reading stripes asynchronously or in parallel, which I assume we already
do to some degree, can mitigate these latencies to some extent.  But I
think in the overall picture, things of this nature are what is driving
parity rebuilds to dozens of hours for many people.  And as I stated
previously, when drives reach 10-20TB, this becomes far worse because
we're reading 2-10x as many stripes.  And the more drives per array the
greater the odds of incurring latency during a stripe read.

With a mirror reconstruction we can stream the reads.  Though we can't
avoid all of the drive issues above, the total number of hiccups causing
latency will be at most 1/7th those of the parity 8 drive array case.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24  4:03                                 ` Stan Hoeppner
@ 2013-11-24  5:14                                   ` John Williams
  2013-11-24 21:13                                     ` Stan Hoeppner
  2013-11-24  5:19                                   ` Russell Coker
  1 sibling, 1 reply; 104+ messages in thread
From: John Williams @ 2013-11-24  5:14 UTC (permalink / raw)
  To: stan
  Cc: NeilBrown, James Plank, Ric Wheeler, Andrea Mazzoleni,
	H. Peter Anvin, Linux RAID Mailing List, Btrfs BTRFS, David Brown,
	David Smith

On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

> Parity array rebuilds are read-modify-write operations.  The main
> difference from normal operation RMWs is that the write is always to the
> same disk.  As long as the stripe reads and chunk reconstruction outrun
> the write throughput then the rebuild speed should be as fast as a
> mirror rebuild.  But this doesn't appear to be what people are
> experiencing.  Parity rebuilds would seem to take much longer.

"This" doesn't appear to be what SOME people, who have reported
issues, are experiencing. Their issues must be examined on a case by
case basis.

But I, and a number of other people I have talked to or corresponded
with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at
approximately the optimal sequential write speed of the replacement
drive. It is not unusual on a reasonably configured system.

I don't know how fast the rebuilds go on the experimental RAID 5 or
RAID 6 for btrfs.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24  4:03                                 ` Stan Hoeppner
  2013-11-24  5:14                                   ` John Williams
@ 2013-11-24  5:19                                   ` Russell Coker
  2013-11-24 21:44                                     ` Stan Hoeppner
  1 sibling, 1 reply; 104+ messages in thread
From: Russell Coker @ 2013-11-24  5:19 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID Mailing List, Btrfs BTRFS

On Sun, 24 Nov 2013, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> I have always surmised that the culprit is rotational latency, because
> we're not able to get a real sector-by-sector streaming read from each
> drive.  If even only one disk in the array has to wait for the platter
> to come round again, the entire stripe read is slowed down by an
> additional few milliseconds.  For example, in an 8 drive array let's say
> each stripe read is slowed 5ms by only one of the 7 drives due to
> rotational latency, maybe acoustical management, or some other firmware
> hiccup in the drive.  This slows down the entire stripe read because we
> can't do parity reconstruction until all chunks are in.  An 8x 2TB array
> with 512KB chunk has 4 million stripes of 4MB each.  Reading 4M stripes,
> that extra 5ms per stripe read costs us
> 
> (4,000,000 * 0.005)/3600 = 5.56 hours

If that is the problem then the solution would be to just enable read-ahead.  
Don't we already have that in both the OS and the disk hardware?  The hard-
drive read-ahead buffer should at least cover the case where a seek completes 
but the desired sector isn't under the heads.

RAM size is steadily increasing, it seems that the smallest that you can get 
nowadays is 1G in a phone and for a server the smallest is probably 4G.

On the smallest system that might have an 8 disk array you should be able to 
use 512M for buffers which allows a read-ahead of 128 chunks.

> Now consider that arrays typically have a few years on them before the
> first drive failure.  During our rebuild it's likely that some drives
> will take a few rotations to return a sector that's marginal.

Are you suggesting that it would be a common case that people just write data 
to an array and never read it or do an array scrub?  I hope that it will 
become standard practice to have a cron job scrubbing all filesystems.

> So  this
> might slow down a stripe read by dozens of milliseconds, maybe a full
> second.  If this happens to multiple drives many times throughout the
> rebuild it will add even more elapsed time, possibly additional hours.

Have you observed such 1 second reads in practice?

One thing I've considered doing is placing a cheap disk on a speaker cone to 
test vibration induced performance problems.  Then I can use a PC to control 
the level of vibration in a reasonably repeatable manner.  I'd like to see 
what the limits are for retries.

Some years ago a company I worked for had some vibration problems which 
dropped the contiguous read speed from about 100MB/s to about 40MB/s on some 
parts of the disk (other parts gave full performance).  That was a serious and 
unusual problem and it only abouty halved the overall speed.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-23 22:10     ` Piergiorgio Sartor
@ 2013-11-24  9:39       ` Andrea Mazzoleni
  0 siblings, 0 replies; 104+ messages in thread
From: Andrea Mazzoleni @ 2013-11-24  9:39 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Linux RAID Mailing List, Btrfs BTRFS, H. Peter Anvin, David Brown,
	David Smith

Hi Piergiorgio,

Checking better I found that for PAR3 it was also evaluated a Cauchy
matrix, but they preferred to use a RS with FFT in GF(2^16 +1).

http://sourceforge.net/mailarchive/forum.php?forum_name=parchive-devel&max_rows=25&style=nested&viewmonth=201006

Note that using a Cauchy matrix for a MDS is something well known.
What I did is just to adapt a Cauchy matrix to be compatible with the
present Linux kernel implementation. I think it has a value only for
Linux or compatible implementations.

Ciao,
Andrea

On Sat, Nov 23, 2013 at 11:10 PM, Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:
> Hi Andrea,
>
> On Sat, Nov 23, 2013 at 08:55:08AM +0100, Andrea Mazzoleni wrote:
>> Hi Piergiorgio,
>>
>> > How about par2? How does this work?
>> I checked the matrix they use, and sometimes it contains some singular
>> square submatrix.
>> It seems that in GF(2^16) these cases are just less common. Maybe they
>> were just unnoticed.
>>
>> Anyway, this seems to be an already known problem for PAR2, with an
>> hypothetical PAR3 fixing it:
>>
>> http://sourceforge.net/p/parchive/discussion/96282/thread/d3c6597b/
>
> you did a pretty damn good research work!
>
> Maybe you should consider to contact them too.
> I'm not sure if your approach can be extended
> to GF(2^16), I guess yes, in that case they
> might be interested too.
>
> bye,
>
> --
>
> piergiorgio

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24  5:14                                   ` John Williams
@ 2013-11-24 21:13                                     ` Stan Hoeppner
  2013-11-24 23:28                                       ` Rudy Zijlstra
                                                         ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-24 21:13 UTC (permalink / raw)
  To: John Williams
  Cc: NeilBrown, James Plank, Ric Wheeler, Andrea Mazzoleni,
	H. Peter Anvin, Linux RAID Mailing List, Btrfs BTRFS, David Brown,
	David Smith

On 11/23/2013 11:14 PM, John Williams wrote:
> On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
>> Parity array rebuilds are read-modify-write operations.  The main
>> difference from normal operation RMWs is that the write is always to the
>> same disk.  As long as the stripe reads and chunk reconstruction outrun
>> the write throughput then the rebuild speed should be as fast as a
>> mirror rebuild.  But this doesn't appear to be what people are
>> experiencing.  Parity rebuilds would seem to take much longer.
> 
> "This" doesn't appear to be what SOME people, who have reported
> issues, are experiencing. Their issues must be examined on a case by
> case basis.

Given what you state below this may very well be the case.

> But I, and a number of other people I have talked to or corresponded
> with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at
> approximately the optimal sequential write speed of the replacement
> drive. It is not unusual on a reasonably configured system.

I freely admit I may have drawn an incorrect conclusion about md parity
rebuild performance based on incomplete data.  I simply don't recall
anyone stating here in ~3 years that their parity rebuilds were speedy,
but quite the opposite.  I guess it's possible that each one of those
cases was due to another factor, such as user load, slow CPU, bus
bottleneck, wonky disk firmware, backplane issues, etc.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24  5:19                                   ` Russell Coker
@ 2013-11-24 21:44                                     ` Stan Hoeppner
  2013-11-24 22:31                                       ` Mark Knecht
  2013-11-25  2:14                                       ` Russell Coker
  0 siblings, 2 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-24 21:44 UTC (permalink / raw)
  To: russell; +Cc: Linux RAID Mailing List, Btrfs BTRFS

On 11/23/2013 11:19 PM, Russell Coker wrote:
> On Sun, 24 Nov 2013, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> I have always surmised that the culprit is rotational latency, because
>> we're not able to get a real sector-by-sector streaming read from each
>> drive.  If even only one disk in the array has to wait for the platter
>> to come round again, the entire stripe read is slowed down by an
>> additional few milliseconds.  For example, in an 8 drive array let's say
>> each stripe read is slowed 5ms by only one of the 7 drives due to
>> rotational latency, maybe acoustical management, or some other firmware
>> hiccup in the drive.  This slows down the entire stripe read because we
>> can't do parity reconstruction until all chunks are in.  An 8x 2TB array
>> with 512KB chunk has 4 million stripes of 4MB each.  Reading 4M stripes,
>> that extra 5ms per stripe read costs us
>>
>> (4,000,000 * 0.005)/3600 = 5.56 hours
> 
> If that is the problem then the solution would be to just enable read-ahead.  
> Don't we already have that in both the OS and the disk hardware?  The hard-
> drive read-ahead buffer should at least cover the case where a seek completes 
> but the desired sector isn't under the heads.

I'm not sure if read-ahead would solve such a problem, if indeed this is
a possible problem.  AFAIK the RAID5/6 drivers process stripes serially,
not asynchronously, so I'd think the rebuild may still stall for ms at a
time in such a situation.

> RAM size is steadily increasing, it seems that the smallest that you can get 
> nowadays is 1G in a phone and for a server the smallest is probably 4G.
> 
> On the smallest system that might have an 8 disk array you should be able to 
> use 512M for buffers which allows a read-ahead of 128 chunks.
> 
>> Now consider that arrays typically have a few years on them before the
>> first drive failure.  During our rebuild it's likely that some drives
>> will take a few rotations to return a sector that's marginal.
> 
> Are you suggesting that it would be a common case that people just write data 
> to an array and never read it or do an array scrub?  I hope that it will 
> become standard practice to have a cron job scrubbing all filesystems.

Given the frequency of RAID5 double drive failure "save me!" help
requests we see on a very regular basis here, it seems pretty clear this
is exactly what many users do.

>> So  this
>> might slow down a stripe read by dozens of milliseconds, maybe a full
>> second.  If this happens to multiple drives many times throughout the
>> rebuild it will add even more elapsed time, possibly additional hours.
> 
> Have you observed such 1 second reads in practice?

We seem to have regular reports from DIY hardware users intentionally
using mismatched consumer drives, as many believe this gives them
additional protection against a firmware bug in a given drive model.
But then they often see multiple second timeouts causing drives to be
kicked, or performance to be slow, because of the mismatched drives.

In my time on this list, it seems pretty clear that the vast majority of
posters use DIY hardware, not matched, packaged, tested solutions from
the likes of Dell, HP, IBM, etc.  Some of the things I've speculated
about in my last few posts could very well occur, and indeed be caused
by, ad hoc component selection and system assembly.  Obviously not in
all DIY cases, but probably many.

-- 
Stan


> One thing I've considered doing is placing a cheap disk on a speaker cone to 
> test vibration induced performance problems.  Then I can use a PC to control 
> the level of vibration in a reasonably repeatable manner.  I'd like to see 
> what the limits are for retries.
> 
> Some years ago a company I worked for had some vibration problems which 
> dropped the contiguous read speed from about 100MB/s to about 40MB/s on some 
> parts of the disk (other parts gave full performance).  That was a serious and 
> unusual problem and it only abouty halved the overall speed.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24 21:44                                     ` Stan Hoeppner
@ 2013-11-24 22:31                                       ` Mark Knecht
  2013-11-25  2:14                                       ` Russell Coker
  1 sibling, 0 replies; 104+ messages in thread
From: Mark Knecht @ 2013-11-24 22:31 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: russell, Linux RAID Mailing List, Btrfs BTRFS

On Sun, Nov 24, 2013 at 1:44 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
<SNIP>
>> Are you suggesting that it would be a common case that people just write data
>> to an array and never read it or do an array scrub?  I hope that it will
>> become standard practice to have a cron job scrubbing all filesystems.
>
> Given the frequency of RAID5 double drive failure "save me!" help
> requests we see on a very regular basis here, it seems pretty clear this
> is exactly what many users do.

Thank you for reminding me to set this up. Sunday afternoons at 2:30
is now my scrub time.

Cheers,
Mark

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24 21:13                                     ` Stan Hoeppner
@ 2013-11-24 23:28                                       ` Rudy Zijlstra
  2013-11-24 23:53                                       ` Alex Elsayed
  2013-11-25  9:15                                       ` David Brown
  2 siblings, 0 replies; 104+ messages in thread
From: Rudy Zijlstra @ 2013-11-24 23:28 UTC (permalink / raw)
  To: stan
  Cc: John Williams, NeilBrown, James Plank, Ric Wheeler,
	Andrea Mazzoleni, H. Peter Anvin, Linux RAID Mailing List,
	Btrfs BTRFS, David Brown, David Smith

On 24-11-13 22:13, Stan Hoeppner wrote:
> I freely admit I may have drawn an incorrect conclusion about md 
> parity rebuild performance based on incomplete data. I simply don't 
> recall anyone stating here in ~3 years that their parity rebuilds were 
> speedy, but quite the opposite. I guess it's possible that each one of 
> those cases was due to another factor, such as user load, slow CPU, 
> bus bottleneck, wonky disk firmware, backplane issues, etc. 
The other part to this is, that the list sees what goes wrong. The cases 
where a replacement goes smoothly do not reach the list

Cheers,


Rudy

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24 21:13                                     ` Stan Hoeppner
  2013-11-24 23:28                                       ` Rudy Zijlstra
@ 2013-11-24 23:53                                       ` Alex Elsayed
  2013-11-25  2:04                                         ` Stan Hoeppner
  2013-11-25  9:15                                       ` David Brown
  2 siblings, 1 reply; 104+ messages in thread
From: Alex Elsayed @ 2013-11-24 23:53 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-btrfs

Stan Hoeppner wrote:

> On 11/23/2013 11:14 PM, John Williams wrote:
>> On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner <stan@hardwarefreak.com>
>> wrote:
<snip>
> 
>> But I, and a number of other people I have talked to or corresponded
>> with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at
>> approximately the optimal sequential write speed of the replacement
>> drive. It is not unusual on a reasonably configured system.
> 
> I freely admit I may have drawn an incorrect conclusion about md parity
> rebuild performance based on incomplete data.  I simply don't recall
> anyone stating here in ~3 years that their parity rebuilds were speedy,
> but quite the opposite.  I guess it's possible that each one of those
> cases was due to another factor, such as user load, slow CPU, bus
> bottleneck, wonky disk firmware, backplane issues, etc.
> 

Well, there's also the issue of selection bias - people come to the list and 
complain when their RAID is taking forever to resync. People generally don't 
come to the list and complain when their RAID resyncs quickly and without 
issues.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24 23:53                                       ` Alex Elsayed
@ 2013-11-25  2:04                                         ` Stan Hoeppner
  2013-11-25  4:48                                           ` Alex Elsayed
  0 siblings, 1 reply; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-25  2:04 UTC (permalink / raw)
  To: Alex Elsayed, linux-raid; +Cc: linux-btrfs

On 11/24/2013 5:53 PM, Alex Elsayed wrote:
> Stan Hoeppner wrote:
> 
>> On 11/23/2013 11:14 PM, John Williams wrote:
>>> On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner <stan@hardwarefreak.com>
>>> wrote:
> <snip>
>>
>>> But I, and a number of other people I have talked to or corresponded
>>> with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at
>>> approximately the optimal sequential write speed of the replacement
>>> drive. It is not unusual on a reasonably configured system.
>>
>> I freely admit I may have drawn an incorrect conclusion about md parity
>> rebuild performance based on incomplete data.  I simply don't recall
>> anyone stating here in ~3 years that their parity rebuilds were speedy,
>> but quite the opposite.  I guess it's possible that each one of those
>> cases was due to another factor, such as user load, slow CPU, bus
>> bottleneck, wonky disk firmware, backplane issues, etc.
>>
> 
> Well, there's also the issue of selection bias - people come to the list and

"Selection bias" would infer I'm doing some kind of formal analysis,
which is obviously not the case, though I do understand the point you're
making.

> complain when their RAID is taking forever to resync. People generally don't 
> come to the list and complain when their RAID resyncs quickly and without 
> issues.

When folks report problems on linux-raid it is commonplace for others to
reply that the same feature works fine for them, that the problem may be
configuration specific, etc.  When people have reported slow RAID5/6
rebuilds in the past, and these were not always reported in direct help
requests but as "me too" posts, I don't recall others saying their
parity rebuilds are speedy.  I'm not saying nobody ever has, simply that
I don't recall such.  Which is why I've been under the impression that
parity rebuilds are generally slow for everyone.

I wish I had hardware available to perform relevant testing.  It would
be nice to have some real data on this showing apples to apples rebuild
times for the various RAID levels on the same hardware.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24 21:44                                     ` Stan Hoeppner
  2013-11-24 22:31                                       ` Mark Knecht
@ 2013-11-25  2:14                                       ` Russell Coker
  2013-11-25  9:20                                         ` David Brown
  1 sibling, 1 reply; 104+ messages in thread
From: Russell Coker @ 2013-11-25  2:14 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID Mailing List, Btrfs BTRFS

On Mon, 25 Nov 2013, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> > If that is the problem then the solution would be to just enable
> > read-ahead. Don't we already have that in both the OS and the disk
> > hardware?  The hard- drive read-ahead buffer should at least cover the
> > case where a seek completes but the desired sector isn't under the
> > heads.
> 
> I'm not sure if read-ahead would solve such a problem, if indeed this is
> a possible problem.  AFAIK the RAID5/6 drivers process stripes serially,
> not asynchronously, so I'd think the rebuild may still stall for ms at a
> time in such a situation.

For a RAID block device (such as Linux software RAID) read-ahead should work 
well.  For a RAID type configuration managed by the filesystem where you might 
have different RAID levels in the same filesystem it might not be possible.

It would be a nice feature to have RAID-0 for unimportant files and RAID-1 or 
RAID-6 for important files on the same filesystem.  But that type of thing 
would really complicate RAID rebuild.

> >> So  this
> >> might slow down a stripe read by dozens of milliseconds, maybe a full
> >> second.  If this happens to multiple drives many times throughout the
> >> rebuild it will add even more elapsed time, possibly additional hours.
> > 
> > Have you observed such 1 second reads in practice?
> 
> We seem to have regular reports from DIY hardware users intentionally
> using mismatched consumer drives, as many believe this gives them
> additional protection against a firmware bug in a given drive model.
> But then they often see multiple second timeouts causing drives to be
> kicked, or performance to be slow, because of the mismatched drives.

I've had that happen in my own RAID-1 and had reports of the same from other 
people.  My problem was that one of the disks was a WD "Green" drive that 
aggressively parked the heads.  When I replaced the array the disk in question 
had parked it's heads about a million times more than it was supposed to but 
it still worked.

I think that if you want to buy two different disks from your local discount 
store then you are likely to end up with one that's not well suited to server 
use because they often don't have more than two types of disks on offer.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-25  2:04                                         ` Stan Hoeppner
@ 2013-11-25  4:48                                           ` Alex Elsayed
  0 siblings, 0 replies; 104+ messages in thread
From: Alex Elsayed @ 2013-11-25  4:48 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-btrfs

Stan Hoeppner wrote:

<snip<
> "Selection bias" would infer I'm doing some kind of formal analysis,
> which is obviously not the case, though I do understand the point you're
> making.

Nope - the only thing 'selection bias' says is that inferences drawn from a 
non-representative dataset will also be non-representative.

Drawing an inference of 'raid rebuilds are slow' from a mailing list people 
come to when they need to report an issue will overemphasize problems and 
deemphasize cases where everything is working properly.

Also, no matter how quickly a rebuild goes, it's still longer than people 
would like. No matter how quickly it finishes, you're still in a stressed 
state where something has _already_ gone wrong and your parachute is now in 
use and unable to save you a second time.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-24 21:13                                     ` Stan Hoeppner
  2013-11-24 23:28                                       ` Rudy Zijlstra
  2013-11-24 23:53                                       ` Alex Elsayed
@ 2013-11-25  9:15                                       ` David Brown
  2 siblings, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-25  9:15 UTC (permalink / raw)
  To: stan, John Williams
  Cc: NeilBrown, James Plank, Ric Wheeler, Andrea Mazzoleni,
	H. Peter Anvin, Linux RAID Mailing List, Btrfs BTRFS, David Smith

On 24/11/13 22:13, Stan Hoeppner wrote:
> On 11/23/2013 11:14 PM, John Williams wrote:
>> On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>
>>> Parity array rebuilds are read-modify-write operations.  The main
>>> difference from normal operation RMWs is that the write is always to the
>>> same disk.  As long as the stripe reads and chunk reconstruction outrun
>>> the write throughput then the rebuild speed should be as fast as a
>>> mirror rebuild.  But this doesn't appear to be what people are
>>> experiencing.  Parity rebuilds would seem to take much longer.
>>
>> "This" doesn't appear to be what SOME people, who have reported
>> issues, are experiencing. Their issues must be examined on a case by
>> case basis.
> 
> Given what you state below this may very well be the case.
> 
>> But I, and a number of other people I have talked to or corresponded
>> with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at
>> approximately the optimal sequential write speed of the replacement
>> drive. It is not unusual on a reasonably configured system.
> 
> I freely admit I may have drawn an incorrect conclusion about md parity
> rebuild performance based on incomplete data.  I simply don't recall
> anyone stating here in ~3 years that their parity rebuilds were speedy,
> but quite the opposite.  I guess it's possible that each one of those
> cases was due to another factor, such as user load, slow CPU, bus
> bottleneck, wonky disk firmware, backplane issues, etc.
> 

Maybe this is just reporting bias - people are quick to post about
problems such as slow rebuilds, but very seldom send a message saying
everything worked perfectly!

There /are/ reasons why parity raid rebuilds are going to be slower than
mirror rebuilds - delays on one disk reading is one issue, and I expect
that simultaneous use of the array for normal work will have more impact
on parity raid rebuild times than on a mirror array (certainly compared
to raid10 with multiple pairs).  I just don't think it is quite as bad
as you think.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-25  2:14                                       ` Russell Coker
@ 2013-11-25  9:20                                         ` David Brown
  0 siblings, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-25  9:20 UTC (permalink / raw)
  To: russell, stan; +Cc: Linux RAID Mailing List, Btrfs BTRFS

On 25/11/13 03:14, Russell Coker wrote:
> On Mon, 25 Nov 2013, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>> If that is the problem then the solution would be to just enable
>>> read-ahead. Don't we already have that in both the OS and the disk
>>> hardware?  The hard- drive read-ahead buffer should at least cover the
>>> case where a seek completes but the desired sector isn't under the
>>> heads.
>>
>> I'm not sure if read-ahead would solve such a problem, if indeed this is
>> a possible problem.  AFAIK the RAID5/6 drivers process stripes serially,
>> not asynchronously, so I'd think the rebuild may still stall for ms at a
>> time in such a situation.
> 
> For a RAID block device (such as Linux software RAID) read-ahead should work 
> well.  For a RAID type configuration managed by the filesystem where you might 
> have different RAID levels in the same filesystem it might not be possible.
> 
> It would be a nice feature to have RAID-0 for unimportant files and RAID-1 or 
> RAID-6 for important files on the same filesystem.  But that type of thing 
> would really complicate RAID rebuild.
> 

I think btrfs is planning to have such features - different files can
have different raid types.  It certainly supports different raid levels
for metadata and file data.  But it is definitely a feature you want on
the filesystem level, rather than the raid block device level.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22  8:12                   ` Russell Coker
@ 2013-11-25 18:23                     ` Pasi Kärkkäinen
  0 siblings, 0 replies; 104+ messages in thread
From: Pasi Kärkkäinen @ 2013-11-25 18:23 UTC (permalink / raw)
  To: Russell Coker; +Cc: linux-btrfs, linux-raid

On Fri, Nov 22, 2013 at 07:12:39PM +1100, Russell Coker wrote:
> 
> On Thu, 21 Nov 2013 18:30:49 Stan Hoeppner wrote:
> > I suggest that anyone in the future needing fast random write IOPS is
> > going to move those workloads to SSD, which is steadily increasing in
> > capacity.  And I suggest anyone building arrays with 10-20TB drives
> > isn't in need of fast random write IOPS.
> 
> Traditionally SCSI/SAS disks have tended to be a lot smaller than IDE/SATA 
> disks.  Now Dell has just started offering 2TB SAS disks while the largest SATA 
> disks that they sell (In Australia on PowerEdge T110 servers at least) are 
> also 2TB.  Presumably RAID recovery time was one factor that made 
> manufacturers not bother with making larger SCSI/SAS disks in the past.
>

Dell has been selling 4TB SAS disks for a long time already.. one year, or something like that.

-- Pasi


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-21 20:31           ` David Brown
  2013-11-21 20:52             ` Piergiorgio Sartor
@ 2013-11-26 18:10             ` joystick
  1 sibling, 0 replies; 104+ messages in thread
From: joystick @ 2013-11-26 18:10 UTC (permalink / raw)
  To: David Brown
  Cc: Piergiorgio Sartor, Andrea Mazzoleni, linux-raid, linux-btrfs,
	hpa, creamyfish

On 21/11/2013 21:31, David Brown wrote:
> On 21/11/13 21:05, Piergiorgio Sartor wrote:
>> Having a multi parity RAID allows to check
>> even which disk.
>> This would provide the user with a more
>> comprehensive (I forgot the spelling)
>> information.
>>
>> Of course, since we are there, we can
>> also give the option to fix it.
>> This would be much likely a "fsck".
> If this can all be done to give the user an informed choice, then it
> sounds good.
>
> One issue here is whether the check should be done with the filesystem
> mounted and in use, or only off-line.  If it is off-line then it will
> mean a long down-time while the array is checked - but if it is online,
> then there is the risk of confusing the filesystem and caches by
> changing the data.
>

Non-existent issue imho, because if that stripe is changing, any error 
will be corrected (overwritten), at least on the data disks (parity can 
still be wrong if a shortcut-rmw method is used).
So you perform fsck for filesystem and data comes out good because any 
error has been overwritten already, and fsck also returns noerror. Not 
useful.

You have to consider only the case where the array check is performed 
online, and the stripe does not change in the meanwhile, that means it 
does not change for a long time, enough for you to complete all the checks.

debugfs techniques can tell you what filesystem element corresponds to a 
certain block number, and this can be done online, with the filesystem 
mounted.

I don't understand the thing you say about the caches. Caches are not an 
obstacle for current "check" operation, so they also won't be a problem 
for the new improved check operation which you are discussing.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-20 21:23                   ` Andrea Mazzoleni
@ 2013-11-27  2:50                     ` ronnie sahlberg
  0 siblings, 0 replies; 104+ messages in thread
From: ronnie sahlberg @ 2013-11-27  2:50 UTC (permalink / raw)
  To: Andrea Mazzoleni; +Cc: Linux RAID Mailing List, Btrfs BTRFS

This is great stuff.
Now, how can we get this into btrfs and md?


On Wed, Nov 20, 2013 at 1:23 PM, Andrea Mazzoleni <amadvance@gmail.com> wrote:
> Hi,
>
>> First, create a 3 by 6 cauchy matrix, using x_i = 2^-i, and y_i = 0 for i=0, and y_i = 2^i for other i.
>> In this case:   x = { 1, 142, 71, 173, 216, 108 }  y = { 0, 2, 4).  The cauchy matrix is:
>>
>>   1   2   4   8  16  32
>> 244  83  78 183 118  47
>> 167  39 213  59 153  82
>>
>> Divide row 2 by 244 and row 3 by 167.  Then extend it with a row of ones on top and it's still MDS,
>> and  that's the code for m=4, with RAID-6 as a subset.  Very nice!
>
> You got it Jim!
>
> Thanks,
> Andrea
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-22 13:24                   ` David Brown
@ 2013-11-28  7:16                     ` Stan Hoeppner
  2013-11-28  7:36                       ` Russell Coker
                                         ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-11-28  7:16 UTC (permalink / raw)
  To: David Brown, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

Late reply.  This one got lost in the flurry of activity...

On 11/22/2013 7:24 AM, David Brown wrote:
> On 22/11/13 09:38, Stan Hoeppner wrote:
>> On 11/21/2013 3:07 AM, David Brown wrote:
>>
>>> For example, with 20 disks at 1 TB each, you can have:
>>
...
>> Maximum:
>>
>> RAID 10 = 10 disk redundancy
>> RAID 15 = 11 disk redundancy
> 
> 12 disks maximum (you have 8 with data, the rest are mirrors, parity, or
> mirrors of parity).
> 
>> RAID 16 = 12 disk redundancy
> 
> 14 disks maximum (you have 6 with data, the rest are mirrors, parity, or
> mirrors of parity).

We must follow different definitions of "redundancy".  I view redundancy
as the number of drives that can fail without taking down the array.  In
the case of the above 20 drive RAID15 that maximum is clearly 11
drives-- one of every mirror and both of one mirror can fail.  The 12th
drive failure kills the array.

>> Range:
>>
>> RAID 10 = 1-10 disk redundancy
>> RAID 15 = 3-11 disk redundancy
>> RAID 16 = 5-12 disk redundancy
>
> Yes, I know these are the minimum redundancies.  But that's a vital
> figure for reliability (even if the range is important for statistical
> averages).  When one disk in a raid10 array fails, your main concern is
> about failures or URE's in the other half of the pair - it doesn't help
> to know that another nine disks can "safely" fail too.

Knowing this is often critical from an architectural standpoint David.
It is quite common to create the mirrors of a RAID10 across two HBAs and
two JBOD chassis.  Some call this "duplexing".  With RAID10 you know you
can lose one HBA, one cable, one JBOD (PSU, expander, etc) and not skip
a beat.  "RAID15" would work the same in this scenario.

This architecture is impossible with RAID5/6.  Any of the mentioned
failures will kill the array.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-28  7:16                     ` Stan Hoeppner
@ 2013-11-28  7:36                       ` Russell Coker
  2013-11-28  9:56                       ` David Brown
  2013-11-30  7:32                       ` Alex Elsayed
  2 siblings, 0 replies; 104+ messages in thread
From: Russell Coker @ 2013-11-28  7:36 UTC (permalink / raw)
  To: stan, linux-raid; +Cc: linux-btrfs

On Thu, 28 Nov 2013, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> We must follow different definitions of "redundancy".  I view redundancy
> as the number of drives that can fail without taking down the array.  In
> the case of the above 20 drive RAID15 that maximum is clearly 11
> drives-- one of every mirror and both of one mirror can fail.  The 12th
> drive failure kills the array.

It seems to me that the more useful number is how many disks can randomly die 
while there is a guarantee that no data is lost.  While with a 20 disk RAID-15 
you could get really lucky and have it still work after 11 disks have been 
lost you could lose it all after 4 disks have been lost.  Given that drive 
failures won't be entirely random (when one disk dies it puts more load on 
it's mirror) having a 20 disk RAID-15 entirely fail after 4 or 5 disk failures 
is probably more likely than having it keep working after 11 failures.

But as far as I can determine in the 20 disk RAID-15 array in question the 
"redundancy" would be 11 disks because once you have lost that many disks the 
usable capacity would be equal to the number of running disks multiplied by 
their capacity.

For simpler RAID configurations the "redundancy" equals the maximum number of 
disks that can randomly fail without losing any data while with nested RAIDs 
that's not the case.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-28  7:16                     ` Stan Hoeppner
  2013-11-28  7:36                       ` Russell Coker
@ 2013-11-28  9:56                       ` David Brown
  2013-11-30  7:32                       ` Alex Elsayed
  2 siblings, 0 replies; 104+ messages in thread
From: David Brown @ 2013-11-28  9:56 UTC (permalink / raw)
  To: stan, James Plank, Ric Wheeler
  Cc: Andrea Mazzoleni, H. Peter Anvin, linux-raid, linux-btrfs,
	David Smith

On 28/11/13 08:16, Stan Hoeppner wrote:
> Late reply.  This one got lost in the flurry of activity...
> 
> On 11/22/2013 7:24 AM, David Brown wrote:
>> On 22/11/13 09:38, Stan Hoeppner wrote:
>>> On 11/21/2013 3:07 AM, David Brown wrote:
>>>
>>>> For example, with 20 disks at 1 TB each, you can have:
>>>
> ...
>>> Maximum:
>>>
>>> RAID 10 = 10 disk redundancy
>>> RAID 15 = 11 disk redundancy
>>
>> 12 disks maximum (you have 8 with data, the rest are mirrors, parity, or
>> mirrors of parity).
>>
>>> RAID 16 = 12 disk redundancy
>>
>> 14 disks maximum (you have 6 with data, the rest are mirrors, parity, or
>> mirrors of parity).
> 
> We must follow different definitions of "redundancy".  I view redundancy
> as the number of drives that can fail without taking down the array.  In
> the case of the above 20 drive RAID15 that maximum is clearly 11
> drives-- one of every mirror and both of one mirror can fail.  The 12th
> drive failure kills the array.
> 

No, we have the same definitions of redundancy - just different
definitions of basic arithmetic.  Your definition is a bit more common!

My error was actually in an earlier email, when I listed the usable
capacities of different layouts for 20 x 1TB drive.  I wrote:

> raid10 = 10TB, 1 disk redundancy
> raid15 = 8TB, 3 disk redundancy
> raid16 = 6TB, 5 disk redundancy

Of course, it should be:

raid10 = 10TB, 1 disk redundancy
raid15 = 9TB, 3 disk redundancy
raid16 = 8TB, 5 disk redundancy

So it is your fault for not spotting my earlier mistake :-)

>>> Range:
>>>
>>> RAID 10 = 1-10 disk redundancy
>>> RAID 15 = 3-11 disk redundancy
>>> RAID 16 = 5-12 disk redundancy
>>
>> Yes, I know these are the minimum redundancies.  But that's a vital
>> figure for reliability (even if the range is important for statistical
>> averages).  When one disk in a raid10 array fails, your main concern is
>> about failures or URE's in the other half of the pair - it doesn't help
>> to know that another nine disks can "safely" fail too.
> 
> Knowing this is often critical from an architectural standpoint David.
> It is quite common to create the mirrors of a RAID10 across two HBAs and
> two JBOD chassis.  Some call this "duplexing".  With RAID10 you know you
> can lose one HBA, one cable, one JBOD (PSU, expander, etc) and not skip
> a beat.  "RAID15" would work the same in this scenario.
> 

That is absolutely true, and I agree that it is very important when
setting up big arrays.  You have to make decisions like where you split
your raid1 pairs - putting them on different controllers/chassis means
you can survive the loss of a whole half of the system.  On the other
hand, putting them on the same controller could mean hardware raid1 is
more efficient and you don't need to duplicate the traffic over the
higher level interfaces.

But here we are looking at one specific class of failures - hard disk
failures (including complete disk failure and URE's).  For that, the
redundancy is the number of disks that can fail without data loss,
assuming the worst possible combination of failures.  And given the
extra stress on the disks during degraded access or rebuilds, "bad"
combinations are more likely than "good" combinations.

So I think it is of little help to say that a 20 disk raid 15 can
survive up to 11 disk failures.  It is far more interesting to say that
it can survive any 3 random disk failures, and (if connected as you
describe with two controllers and chassis) it can also survive the
complete failure of a chassis or controller while still retaining a one
disk redundancy.

As a side issue here, I wonder if a write intent bitmap can be used for
a chassis failure so that when the chassis is fixed (the controller card
replaced, the cable re-connected, etc.) the disks inside can be brought
up to sync again without a full rebuild.

> This architecture is impossible with RAID5/6.  Any of the mentioned
> failures will kill the array.
> 

Yes.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-28  7:16                     ` Stan Hoeppner
  2013-11-28  7:36                       ` Russell Coker
  2013-11-28  9:56                       ` David Brown
@ 2013-11-30  7:32                       ` Alex Elsayed
  2013-12-01 15:37                         ` Stan Hoeppner
  2 siblings, 1 reply; 104+ messages in thread
From: Alex Elsayed @ 2013-11-30  7:32 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-btrfs

Stan Hoeppner wrote:

> Late reply.  This one got lost in the flurry of activity...
> 
<snip>
> 
> We must follow different definitions of "redundancy".  I view redundancy
> as the number of drives that can fail without taking down the array.  In
> the case of the above 20 drive RAID15 that maximum is clearly 11
> drives-- one of every mirror and both of one mirror can fail.  The 12th
> drive failure kills the array.

IMHO, 'redundancy' is not the important thing here, and may conflate two 
things - 'how much storage is spent on things other than my data (mirrors, 
parity)' [storage efficiency] and 'if the universe is trying to screw me 
over, how many disk losses can I survive?' [failure resilience]

Your 11 disks number is best-case, but quicksort has taught me to always 
look at average-case and worst-case as well. What you describe has very good 
best-case failure resilience, but its worst-case resilience is poorer. It 
has better best-case, average-case, *and* worst-case performance, but has 
considerably worse storage efficiency.

All of those need to be weighed in deciding which to use; raid 15 being 
'just better' isn't something that can be claimed. It depends on the 
workload.

<snip>
> Knowing this is often critical from an architectural standpoint David.
> It is quite common to create the mirrors of a RAID10 across two HBAs and
> two JBOD chassis.  Some call this "duplexing".  With RAID10 you know you
> can lose one HBA, one cable, one JBOD (PSU, expander, etc) and not skip
> a beat.  "RAID15" would work the same in this scenario.
> 
> This architecture is impossible with RAID5/6.  Any of the mentioned
> failures will kill the array.

And again, these address different issues. For one, there's multipath - with 
dual-ported drives, multipath, etc. you can tolerate HBA failure; that 
renders part of the issue something of a canard.

Adding a second JBOD is really not an apples-to-apples comparison, either - 
it's not likely to be a useful configuration in the same situations as lend 
themselves to parity >> 2. Certainly, no-one is advocating that support for 
RAID 10 be removed, and mdadm is certainly capable of handling a manually-
created RAID 15 today.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-11-30  7:32                       ` Alex Elsayed
@ 2013-12-01 15:37                         ` Stan Hoeppner
  0 siblings, 0 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-12-01 15:37 UTC (permalink / raw)
  To: Alex Elsayed, linux-raid

On 11/30/2013 1:32 AM, Alex Elsayed wrote:
> Stan Hoeppner wrote:
> 
>> Late reply.  This one got lost in the flurry of activity...
>>
> <snip>
>>
>> We must follow different definitions of "redundancy".  I view redundancy
>> as the number of drives that can fail without taking down the array.  In
>> the case of the above 20 drive RAID15 that maximum is clearly 11
>> drives-- one of every mirror and both of one mirror can fail.  The 12th
>> drive failure kills the array.
> 
> IMHO, 'redundancy' is not the important thing here, and may conflate two 
> things - 'how much storage is spent on things other than my data (mirrors, 
> parity)' [storage efficiency] and 'if the universe is trying to screw me 
> over, how many disk losses can I survive?' [failure resilience]
> 
> Your 11 disks number is best-case, 

This is not -my- number.  It is simply the best case RAID15 number.
Just as the worst case is 3.  Both numbers are part of the decision
making equation.  In my response to David, in which you cut the context,
I was simply stating his numbers were different than mine, and likely
wrong, which they were.  I was not making the argument, as you seem to
suggest here, that one should look at the best case 11 disk redundancy
as a magical number on which to make a decision.

> but quicksort has taught me to always 
> look at average-case and worst-case as well. What you describe has very good 
> best-case failure resilience, but its worst-case resilience is poorer. It 
> has better best-case, average-case, *and* worst-case performance, but has 
> considerably worse storage efficiency.

This has already been stated many times in the thread, more than once by
me.

> All of those need to be weighed in deciding which to use; raid 15 being 
> 'just better' isn't something that can be claimed. It depends on the 
> workload.

Again, already stated multiple times.

> <snip>
>> Knowing this is often critical from an architectural standpoint David.
>> It is quite common to create the mirrors of a RAID10 across two HBAs and
>> two JBOD chassis.  Some call this "duplexing".  With RAID10 you know you
>> can lose one HBA, one cable, one JBOD (PSU, expander, etc) and not skip
>> a beat.  "RAID15" would work the same in this scenario.
>>
>> This architecture is impossible with RAID5/6.  Any of the mentioned
>> failures will kill the array.
> 
> And again, these address different issues. 

"And again"?  This is the first time this has been discussed in this
thread.  And this isn't a "different issue".  It is part of the storage
architecture equation, and for some critically important.

> For one, there's multipath - with 
> dual-ported drives, multipath, etc. you can tolerate HBA failure; that 
> renders part of the issue something of a canard.

Except for one critical factor:  Only select few SAS drive models offer
dual ports, and they are very pricey, typically only offered in "Big 5"
vendor integrated storage products, usually SAN arrays, which include
RAID capability.  You don't often find dual-ported SAS drives available
in anyone's JBOD chassis.  ZERO SATA drives are dual ported.  The vast
majority of people using md/RAID are using SATA disks, not SAS.

To make such an argument displays some serious ignorance.  To call my
statement of fact a canard displays even more...

> Adding a second JBOD is really not an apples-to-apples comparison, either - 
> it's not likely to be a useful configuration in the same situations as lend 
> themselves to parity

Introducing a straw man?  Either that's intentional or you didn't read
my reply to David in context.  Please go back and read the exchange
again.  And for that matter, re-read the entire thread.  I've clearly
state once, if not twice, that RAID15 is something that only current
RAID10 users would be interested in, and that current parity RAID users
would shun it.  Just as you are here.

> Certainly, no-one is advocating that support for 
> RAID 10 be removed, and 

Where are you coming up with such nonsense?  Of course no one has
suggested this.  Why did you make this statement?

> mdadm is certainly capable of handling a manually-
> created RAID 15 today.

No, it is certainly not capable today.  Somehow you have missed all of
the posts explaining and discussing the key feature requirements of the
proposed RAID15 personality.

Considering the quantity of factual errors you've made in this post
Alex, I suggest you get your ducks in a row before your next reply, or
simply stay out of this thread entirely.  To date you've posted 3
replies, and not a one has added anything of value to the discussion.

I should have started a separate thread for the RAID15 idea, instead
dropping it into the triple parity thread.  That would have kept the
BTRFS users out of the discussion, as there would have been no reason to
CC that list, as was done in this case...

Speaking of which, no need to reply-all to BTRFS as this discussion has
nothing to do with it, thus removed.

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
@ 2013-12-01 17:53 Richard Scobie
  2013-12-02  4:30 ` Stan Hoeppner
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Scobie @ 2013-12-01 17:53 UTC (permalink / raw)
  To: Linux RAID Mailing List

Stan Hoeppner wrote:

> ZERO SATA drives are dual ported.  The vast majority of people using
> md/RAID are using SATA disks, not SAS.

While admittedly uncommon, they are available.

http://h30094.www3.hp.com/product/sku/10435415

Regards,

Richard

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Triple parity and beyond
  2013-12-01 17:53 Triple parity and beyond Richard Scobie
@ 2013-12-02  4:30 ` Stan Hoeppner
  0 siblings, 0 replies; 104+ messages in thread
From: Stan Hoeppner @ 2013-12-02  4:30 UTC (permalink / raw)
  To: Richard Scobie, Linux RAID Mailing List

On 12/1/2013 11:53 AM, Richard Scobie wrote:
> Stan Hoeppner wrote:
> 
>> ZERO SATA drives are dual ported.  The vast majority of people using
>> md/RAID are using SATA disks, not SAS.
> 
> While admittedly uncommon, they are available.
> 
> http://h30094.www3.hp.com/product/sku/10435415

Assuming that's not a misprint, then this is not a dual ported SATA
drive.  The SATA protocol doesn't support this.  This is a SATA drive in
HP's vendor proprietary carrier with a proprietary factory integrated
external SAS interposer card hanging off the back.  This interposer
allows you to add this SATA drive to an HP dual ported SAS enclosure
backplane and HBAs.  They also offer an interposer to drop SATA drives
into FC enclosures.

Note these SAS-SATA interposers require SAS HBAs and an SAS enclosure.
Given the small price difference between SATA drives w/interposer, and
native nearline SAS drives, and the greater reliability/warranty of the
latter, I don't think you'll see many folks buying these interposer
drives.  Especially given the required investment in SAS infrastructure
between the drives and host.

See:

http://serialstoragewire.net/Articles/2007_07/developer24.html
http://en.wikipedia.org/wiki/Interposer

-- 
Stan

^ permalink raw reply	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2013-12-02  4:30 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-01 17:53 Triple parity and beyond Richard Scobie
2013-12-02  4:30 ` Stan Hoeppner
  -- strict thread matches above, loose matches on Subject: below --
2013-11-18 22:08 Andrea Mazzoleni
2013-11-18 22:12 ` H. Peter Anvin
2013-11-18 22:35   ` Andrea Mazzoleni
2013-11-18 23:25     ` H. Peter Anvin
2013-11-19 10:16       ` David Brown
2013-11-19 17:36         ` Andrea Mazzoleni
2013-11-19 22:51           ` Drew
2013-11-20  0:54             ` Chris Murphy
2013-11-20  1:23               ` John Williams
2013-11-20 10:35                 ` David Brown
2013-11-20 10:31           ` David Brown
2013-11-20 18:09             ` John Williams
2013-11-20 18:44               ` Andrea Mazzoleni
2013-11-21  6:15                 ` Stan Hoeppner
2013-11-21  8:32               ` David Brown
2013-11-20 18:34             ` Andrea Mazzoleni
2013-11-20 18:43               ` H. Peter Anvin
2013-11-20 18:56                 ` Andrea Mazzoleni
2013-11-20 18:59                   ` H. Peter Anvin
2013-11-20 21:21                     ` Andrea Mazzoleni
2013-11-20 19:00                   ` H. Peter Anvin
2013-11-20 21:04                     ` Andrea Mazzoleni
2013-11-20 21:06                       ` H. Peter Anvin
2013-11-21  8:36               ` David Brown
2013-11-19 17:28       ` Andrea Mazzoleni
2013-11-19 20:29         ` Ric Wheeler
2013-11-20 16:16           ` James Plank
2013-11-20 19:05             ` Andrea Mazzoleni
2013-11-20 19:10               ` H. Peter Anvin
2013-11-20 20:30                 ` James Plank
2013-11-20 21:23                   ` Andrea Mazzoleni
2013-11-27  2:50                     ` ronnie sahlberg
2013-11-20 21:28                   ` H. Peter Anvin
2013-11-21  1:28             ` Stan Hoeppner
2013-11-21  2:46               ` John Williams
2013-11-21  6:52                 ` Stan Hoeppner
2013-11-21  7:05                   ` John Williams
2013-11-21 22:57                     ` Stan Hoeppner
2013-11-21 23:38                       ` John Williams
2013-11-22  9:35                         ` Stan Hoeppner
2013-11-22 11:24                           ` joystick
2013-11-22 15:01                           ` John Williams
2013-11-22 22:28                             ` Stan Hoeppner
2013-11-22 23:07                       ` NeilBrown
2013-11-23  3:46                         ` Stan Hoeppner
2013-11-23  5:04                           ` NeilBrown
2013-11-23  5:34                             ` John Williams
2013-11-23  7:12                               ` NeilBrown
2013-11-24  4:03                                 ` Stan Hoeppner
2013-11-24  5:14                                   ` John Williams
2013-11-24 21:13                                     ` Stan Hoeppner
2013-11-24 23:28                                       ` Rudy Zijlstra
2013-11-24 23:53                                       ` Alex Elsayed
2013-11-25  2:04                                         ` Stan Hoeppner
2013-11-25  4:48                                           ` Alex Elsayed
2013-11-25  9:15                                       ` David Brown
2013-11-24  5:19                                   ` Russell Coker
2013-11-24 21:44                                     ` Stan Hoeppner
2013-11-24 22:31                                       ` Mark Knecht
2013-11-25  2:14                                       ` Russell Coker
2013-11-25  9:20                                         ` David Brown
2013-11-21  8:08               ` joystick
2013-11-22  0:30                 ` Stan Hoeppner
2013-11-22  0:33                   ` H. Peter Anvin
2013-11-22  0:45                   ` David Brown
2013-11-21  9:07               ` David Brown
2013-11-21  9:54                 ` Adam Goryachev
2013-11-21 10:32                   ` David Brown
2013-11-22  8:12                   ` Russell Coker
2013-11-25 18:23                     ` Pasi Kärkkäinen
2013-11-22  8:13                 ` Stan Hoeppner
2013-11-22 13:15                   ` David Brown
2013-11-22 16:07                   ` Stan Hoeppner
2013-11-22 22:59                     ` NeilBrown
2013-11-23 17:39                       ` David Brown
2013-11-22 16:50                   ` Mark Knecht
2013-11-22 19:51                     ` Duncan
2013-11-22  8:38                 ` Stan Hoeppner
2013-11-22 13:24                   ` David Brown
2013-11-28  7:16                     ` Stan Hoeppner
2013-11-28  7:36                       ` Russell Coker
2013-11-28  9:56                       ` David Brown
2013-11-30  7:32                       ` Alex Elsayed
2013-12-01 15:37                         ` Stan Hoeppner
2013-11-22 14:19                   ` David Taylor
2013-11-21 19:56               ` Piergiorgio Sartor
2013-11-19 18:12 ` Piergiorgio Sartor
2013-11-20 10:44   ` David Brown
2013-11-20 21:59     ` Piergiorgio Sartor
2013-11-21 10:13       ` David Brown
2013-11-21 17:37         ` Goffredo Baroncelli
2013-11-21 20:05         ` Piergiorgio Sartor
2013-11-21 20:31           ` David Brown
2013-11-21 20:52             ` Piergiorgio Sartor
2013-11-22  0:32               ` David Brown
2013-11-22 20:32                 ` Piergiorgio Sartor
2013-11-26 18:10             ` joystick
2013-11-20 21:38   ` Andrea Mazzoleni
2013-11-20 22:29 ` Piergiorgio Sartor
2013-11-23  7:55   ` Andrea Mazzoleni
2013-11-23 22:10     ` Piergiorgio Sartor
2013-11-24  9:39       ` Andrea Mazzoleni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).