[LSF/MM TOPIC] De-clustered RAID with MD

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] De-clustered RAID with MD
@ 2018-01-29 15:23 Johannes Thumshirn
  2018-01-29 16:32 ` Wols Lists
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Thumshirn @ 2018-01-29 15:23 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

Hi linux-raid, lsf-pc

(If you've received this mail multiple times, I'm sorry, I'm having
trouble with the mail setup).

With the rise of bigger and bigger disks, array rebuilding times start
skyrocketing.

In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
similar to RAID5 but instead of utilizing all disks in an array for
every I/O operation, but implement a per-I/O mapping function to only
use a subset of the available disks.

This has at least two advantages:
1) If one disk has to be replaced, it's not needed to read the data from
   all disks to recover the one failed disk so non-affected disks can be
   used for real user I/O and not just recovery and
2) an efficient mapping function can improve parallel I/O submission, as
   two different I/Os are not necessarily going to the same disks in the
   array. 

For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
would be ideal, as it provides a pseudo random but deterministic mapping
for the I/O onto the drives.

This whole declustering of cause only makes sense for more than (at
least) 4 drives but we do have customers with several orders of
magnitude more drivers in an MD array.

At LSF I'd like to discuss if:
1) The wider MD audience is interested in de-clusterd RAID with MD
2) de-clustered RAID should be implemented as a sublevel of RAID5 or
   as a new personality
3) CRUSH is a suitible algorith for this (there's evidence in [3] that
   the NetApp E-Series Arrays do use CRUSH for parity declustering)

[1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
[2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
[3]
https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf

Thanks,
        Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-29 15:23 [LSF/MM TOPIC] De-clustered RAID with MD Johannes Thumshirn
@ 2018-01-29 16:32 ` Wols Lists
  2018-01-29 21:50   ` NeilBrown
  2018-01-30  9:40   ` Johannes Thumshirn
  0 siblings, 2 replies; 14+ messages in thread
From: Wols Lists @ 2018-01-29 16:32 UTC (permalink / raw)
  To: Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

On 29/01/18 15:23, Johannes Thumshirn wrote:
> Hi linux-raid, lsf-pc
> 
> (If you've received this mail multiple times, I'm sorry, I'm having
> trouble with the mail setup).

My immediate reactions as a lay person (I edit the raid wiki) ...
> 
> With the rise of bigger and bigger disks, array rebuilding times start
> skyrocketing.

And? Yes, your data is at risk during a rebuild, but md-raid throttles
the i/o, so it doesn't hammer the system.
> 
> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
> similar to RAID5 but instead of utilizing all disks in an array for
> every I/O operation, but implement a per-I/O mapping function to only
> use a subset of the available disks.
> 
> This has at least two advantages:
> 1) If one disk has to be replaced, it's not needed to read the data from
>    all disks to recover the one failed disk so non-affected disks can be
>    used for real user I/O and not just recovery and

Again, that's throttling, so that's not a problem ...

> 2) an efficient mapping function can improve parallel I/O submission, as
>    two different I/Os are not necessarily going to the same disks in the
>    array. 
> 
> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
> would be ideal, as it provides a pseudo random but deterministic mapping
> for the I/O onto the drives.
> 
> This whole declustering of cause only makes sense for more than (at
> least) 4 drives but we do have customers with several orders of
> magnitude more drivers in an MD array.

If you have four drives or more - especially if they are multi-terabyte
drives - you should NOT be using raid-5 ...
> 
> At LSF I'd like to discuss if:
> 1) The wider MD audience is interested in de-clusterd RAID with MD

I haven't read the papers, so no comment, sorry.

> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or
>    as a new personality

Neither! If you're going to do it, it should be raid-6.

> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that
>    the NetApp E-Series Arrays do use CRUSH for parity declustering)
> 
> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
> [3]
> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
> 
Okay - I've now skimmed the crush paper [2]. Looks well interesting.
BUT. It feels more like btrfs than it does like raid.

Btrfs manages disks, and does raid, it tries to be the "everything
between the hard drive and the file". This crush thingy reads to me like
it wants to be the same. There's nothing wrong with that, but md is a
unix-y "do one thing (raid) and do it well".

My knee-jerk reaction is if you want to go for it, it sounds like a good
idea. It just doesn't really feel a good fit for md.

Cheers,
Wol


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-29 16:32 ` Wols Lists
@ 2018-01-29 21:50   ` NeilBrown
  2018-01-30 10:43     ` Wols Lists
  2018-01-31  9:58     ` David Brown
  2018-01-30  9:40   ` Johannes Thumshirn
  1 sibling, 2 replies; 14+ messages in thread
From: NeilBrown @ 2018-01-29 21:50 UTC (permalink / raw)
  To: Wols Lists, Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

[-- Attachment #1: Type: text/plain, Size: 4381 bytes --]

On Mon, Jan 29 2018, Wols Lists wrote:

> On 29/01/18 15:23, Johannes Thumshirn wrote:
>> Hi linux-raid, lsf-pc
>> 
>> (If you've received this mail multiple times, I'm sorry, I'm having
>> trouble with the mail setup).
>
> My immediate reactions as a lay person (I edit the raid wiki) ...
>> 
>> With the rise of bigger and bigger disks, array rebuilding times start
>> skyrocketing.
>
> And? Yes, your data is at risk during a rebuild, but md-raid throttles
> the i/o, so it doesn't hammer the system.
>> 
>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>> similar to RAID5 but instead of utilizing all disks in an array for
>> every I/O operation, but implement a per-I/O mapping function to only
>> use a subset of the available disks.
>> 
>> This has at least two advantages:
>> 1) If one disk has to be replaced, it's not needed to read the data from
>>    all disks to recover the one failed disk so non-affected disks can be
>>    used for real user I/O and not just recovery and
>
> Again, that's throttling, so that's not a problem ...

Imagine an array with 100 drives on which we store data in sets of
(say) 6 data chunks and 2 parity chunks.
Each group of 8 chunks is distributed over the 100 drives in a
different way so that (e.g) 600 data chunks and 200 parity chunks are
distributed over 8 physical stripes using some clever distribution
function.
If (when) one drive fails, the 8 chunks in this set of 8 physical
stripes can be recovered by reading 6*8 == 48 chunks which will each be
on a different drive.  Half the drives deliver only one chunk (in an ideal
distribution) and the other half deliver none.  Maybe they will deliver
some for the next set of 100 logical stripes.

You would probably say that even doing raid6 on 100 drives is crazy.
Better to make, e.g. 10 groups of 10 and do raid6 on each of the 10,
then LVM them together.

By doing declustered parity you can sanely do raid6 on 100 drives, using
a logical stripe size that is much smaller than 100.
When recovering a single drive, the 10-groups-of-10 would put heavy load
on 9 other drives, while the decluster approach puts light load on 99
other drives.  No matter how clever md is at throttling recovery, I
would still rather distribute the load so that md has an easier job.

NeilBrown

>
>> 2) an efficient mapping function can improve parallel I/O submission, as
>>    two different I/Os are not necessarily going to the same disks in the
>>    array. 
>> 
>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
>> would be ideal, as it provides a pseudo random but deterministic mapping
>> for the I/O onto the drives.
>> 
>> This whole declustering of cause only makes sense for more than (at
>> least) 4 drives but we do have customers with several orders of
>> magnitude more drivers in an MD array.
>
> If you have four drives or more - especially if they are multi-terabyte
> drives - you should NOT be using raid-5 ...
>> 
>> At LSF I'd like to discuss if:
>> 1) The wider MD audience is interested in de-clusterd RAID with MD
>
> I haven't read the papers, so no comment, sorry.
>
>> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or
>>    as a new personality
>
> Neither! If you're going to do it, it should be raid-6.
>
>> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that
>>    the NetApp E-Series Arrays do use CRUSH for parity declustering)
>> 
>> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
>> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
>> [3]
>> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
>> 
> Okay - I've now skimmed the crush paper [2]. Looks well interesting.
> BUT. It feels more like btrfs than it does like raid.
>
> Btrfs manages disks, and does raid, it tries to be the "everything
> between the hard drive and the file". This crush thingy reads to me like
> it wants to be the same. There's nothing wrong with that, but md is a
> unix-y "do one thing (raid) and do it well".
>
> My knee-jerk reaction is if you want to go for it, it sounds like a good
> idea. It just doesn't really feel a good fit for md.
>
> Cheers,
> Wol

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-29 21:50   ` NeilBrown
@ 2018-01-30 10:43     ` Wols Lists
  2018-01-30 11:24       ` NeilBrown
  2018-01-31  9:58     ` David Brown
  1 sibling, 1 reply; 14+ messages in thread
From: Wols Lists @ 2018-01-30 10:43 UTC (permalink / raw)
  To: NeilBrown, Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

On 29/01/18 21:50, NeilBrown wrote:
> By doing declustered parity you can sanely do raid6 on 100 drives, using
> a logical stripe size that is much smaller than 100.
> When recovering a single drive, the 10-groups-of-10 would put heavy load
> on 9 other drives, while the decluster approach puts light load on 99
> other drives.  No matter how clever md is at throttling recovery, I
> would still rather distribute the load so that md has an easier job.

Not offering to do it ... :-)

But that sounds a bit like linux raid-10. Could a simple approach be to
do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two
parity, striped across 100 drives? Okay, it's not as good as the
decluster approach, but it would spread the stress of a rebuild across
20 drives, not 10. And probably be fairly easy to implement.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-30 10:43     ` Wols Lists
@ 2018-01-30 11:24       ` NeilBrown
  2018-01-30 17:40         ` Wol's lists
                           ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: NeilBrown @ 2018-01-30 11:24 UTC (permalink / raw)
  To: Wols Lists, Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

[-- Attachment #1: Type: text/plain, Size: 1848 bytes --]

On Tue, Jan 30 2018, Wols Lists wrote:

> On 29/01/18 21:50, NeilBrown wrote:
>> By doing declustered parity you can sanely do raid6 on 100 drives, using
>> a logical stripe size that is much smaller than 100.
>> When recovering a single drive, the 10-groups-of-10 would put heavy load
>> on 9 other drives, while the decluster approach puts light load on 99
>> other drives.  No matter how clever md is at throttling recovery, I
>> would still rather distribute the load so that md has an easier job.
>
> Not offering to do it ... :-)
>
> But that sounds a bit like linux raid-10. Could a simple approach be to
> do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two
> parity, striped across 100 drives? Okay, it's not as good as the
> decluster approach, but it would spread the stress of a rebuild across
> 20 drives, not 10. And probably be fairly easy to implement.

If you did that, I think you would be about 80% of the way to fully
declustered-parity RAID.
If you then tweak the math a bit so that one stripe would was

A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 ....

and the next

A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 ....

and then

A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 ....

                  XX

When Ax are a logical stripe and Bx are the next,  then you have a
slightly better distribution.  If device XX fails then the reads needed
for the first stripe mostly come from different drives than those for
the second stripe, which are mostly different again for the 3rd stripe.

Presumably the CRUSH algorithm (which I only skim-read once about a year
ago) formalizes how to do this, and does it better.
Once you have the data handling in place for your proposal, it should be
little more than replacing a couple of calculations to get the full
solution.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-30 11:24       ` NeilBrown
@ 2018-01-30 17:40         ` Wol's lists
  2018-02-03 15:53         ` Wols Lists
  2018-02-03 17:16         ` Wols Lists
  2 siblings, 0 replies; 14+ messages in thread
From: Wol's lists @ 2018-01-30 17:40 UTC (permalink / raw)
  To: NeilBrown, Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

On 30/01/18 11:24, NeilBrown wrote:
> On Tue, Jan 30 2018, Wols Lists wrote:
> 
>> On 29/01/18 21:50, NeilBrown wrote:
>>> By doing declustered parity you can sanely do raid6 on 100 drives, using
>>> a logical stripe size that is much smaller than 100.
>>> When recovering a single drive, the 10-groups-of-10 would put heavy load
>>> on 9 other drives, while the decluster approach puts light load on 99
>>> other drives.  No matter how clever md is at throttling recovery, I
>>> would still rather distribute the load so that md has an easier job.
>>
>> Not offering to do it ... :-)
>>
>> But that sounds a bit like linux raid-10. Could a simple approach be to
>> do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two
>> parity, striped across 100 drives? Okay, it's not as good as the
>> decluster approach, but it would spread the stress of a rebuild across
>> 20 drives, not 10. And probably be fairly easy to implement.
> 
> If you did that, I think you would be about 80% of the way to fully
> declustered-parity RAID.
> If you then tweak the math a bit so that one stripe would was
> 
> A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 ....
> 
> and the next
> 
> A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 ....
> 
> and then
> 
> A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 ....
> 
>                    XX
>                    
> When Ax are a logical stripe and Bx are the next,  then you have a
> slightly better distribution.  If device XX fails then the reads needed
> for the first stripe mostly come from different drives than those for
> the second stripe, which are mostly different again for the 3rd stripe.
> 
> Presumably the CRUSH algorithm (which I only skim-read once about a year
> ago) formalizes how to do this, and does it better.
> Once you have the data handling in place for your proposal, it should be
> little more than replacing a couple of calculations to get the full
> solution.
> 
Okay. I think I have it - a definition for raid-16 (or is it raid-61). 
But I need a bit of help with the maths. And it might need a look-up 
table :-(

Okay. Like raid-10, raid-16 would be spec'd as "--level 16,3,8", ie 3 
mirrors, emulating an 8-drive raid-6.

What I'm looking for is a PRNG that has the "bug" that it repeats over a 
short period, and over that period is guaranteed to repeat each number 
the same number of times. I saw a wonderful video demonstration of this 
years ago - if you plot the generated number against the number of times 
it was generated, after a few rows it "filled up" a rectangle on the graph.

At which point the maths becomes very simple. I just need at least as 
many real drives as "mirrors times emulated".

If somebody can come up with the algorithm I want, I can spec it, and 
then maybe someone can implement it? It'll be fun testing - I'll need my 
new machine when I get it working :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-30 11:24       ` NeilBrown
  2018-01-30 17:40         ` Wol's lists
@ 2018-02-03 15:53         ` Wols Lists
  2018-02-03 17:16         ` Wols Lists
  2 siblings, 0 replies; 14+ messages in thread
From: Wols Lists @ 2018-02-03 15:53 UTC (permalink / raw)
  To: NeilBrown, Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

On 30/01/18 11:24, NeilBrown wrote:
> When Ax are a logical stripe and Bx are the next,  then you have a
> slightly better distribution.  If device XX fails then the reads needed
> for the first stripe mostly come from different drives than those for
> the second stripe, which are mostly different again for the 3rd stripe.

Well, my maths has been well and truly proved wrong. But. Does streaming
from one drive directly onto another have value in reducing stress? I've
just done a worked example for four drives, mirrored three times across
eighteen drives.

The result is that the drives have ended up in six groups of three
mirrored drives. So if any drive fails, I can just copy an
almost-identical drive from the mirror.

I'll have to play with eg 17 drives and see if this alters the dynamics
- it'll be interesting.

 0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
a1  b2  c3  a4  c1  a2  b3  c4  b1  c2  a3  b4  a1  b2  c3  a4  c1  a2
b3  c4  b1  c2  a3  b4  a1  b2  c3  a4  c1  a2  b3  c4  b1  c2  a3  b4

Note I'm rotating 5 drives at a time, so a1,a2,a3,a4 goes on drives
0,5,10,15 ... But pick any drive and you'll notice the drives six away
are almost mirrors.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-30 11:24       ` NeilBrown
  2018-01-30 17:40         ` Wol's lists
  2018-02-03 15:53         ` Wols Lists
@ 2018-02-03 17:16         ` Wols Lists
  2 siblings, 0 replies; 14+ messages in thread
From: Wols Lists @ 2018-02-03 17:16 UTC (permalink / raw)
  To: NeilBrown, Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

On 30/01/18 11:24, NeilBrown wrote:
> When Ax are a logical stripe and Bx are the next,  then you have a
> slightly better distribution.  If device XX fails then the reads needed
> for the first stripe mostly come from different drives than those for
> the second stripe, which are mostly different again for the 3rd stripe.

Having got myself thoroughly confused, I think I've got my head
straight. There's actually an incredibly simple algorithm for this, I think.

First of all, let's state that we're defining a new complex raid here,
people get confused between linux-raid-10 and raid-1+0, so ...

I was originally thinking about about raid-61, mirroring raid-6 across a
random number of drives. I think here we're actually talking about
raid-60 - doing a raid-6 across a random number of drives. Can we prefix
whatever raid level we call it with an "L" so it's not confused with
standard raid-6+0?

Okay. Let's say we have a five-disk raid-6 smeared across seven drives.
Let's call it Raid-L60. That gives us 35 blocks per disk stripe.

A1 A2 A3 A4 A5 B1 B2
B3 B4 B5 C1 C2 C3 C4
C5 D1 D2 D3 D4 D5 E1
E2 E3 E4 E5 F1 F2 F3
F4 F5 G1 G2 G3 G4 G5

That gives us a logical disk stripe 35 wide, that we just need to scramble.

Position * Prime mod Disks.

The problem, of course, is that this is going to scramble each logical
stripe the same. So as we've got 5 stripes, can we use a table of 5
different primes? The snag, of course, is that we mustn't use a prime
that is a factor of the number of disks, as this will blow up with a
vengeance. It makes sense also to avoid a prime that's a factor of the
logical disks, I suspect. So here, I'd use 2,3,11,13,17

I'd like to add the logical stripe number in rather than use multiple
primes, but I don't think that'll work ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-29 21:50   ` NeilBrown
  2018-01-30 10:43     ` Wols Lists
@ 2018-01-31  9:58     ` David Brown
  2018-01-31 10:58       ` Johannes Thumshirn
  2018-01-31 14:27       ` Wols Lists
  1 sibling, 2 replies; 14+ messages in thread
From: David Brown @ 2018-01-31  9:58 UTC (permalink / raw)
  To: NeilBrown, Wols Lists, Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

On 29/01/18 22:50, NeilBrown wrote:
> On Mon, Jan 29 2018, Wols Lists wrote:
> 
>> On 29/01/18 15:23, Johannes Thumshirn wrote:
>>> Hi linux-raid, lsf-pc
>>>
>>> (If you've received this mail multiple times, I'm sorry, I'm having
>>> trouble with the mail setup).
>>
>> My immediate reactions as a lay person (I edit the raid wiki) ...
>>>
>>> With the rise of bigger and bigger disks, array rebuilding times start
>>> skyrocketing.
>>
>> And? Yes, your data is at risk during a rebuild, but md-raid throttles
>> the i/o, so it doesn't hammer the system.
>>>
>>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>>> similar to RAID5 but instead of utilizing all disks in an array for
>>> every I/O operation, but implement a per-I/O mapping function to only
>>> use a subset of the available disks.
>>>
>>> This has at least two advantages:
>>> 1) If one disk has to be replaced, it's not needed to read the data from
>>>    all disks to recover the one failed disk so non-affected disks can be
>>>    used for real user I/O and not just recovery and
>>
>> Again, that's throttling, so that's not a problem ...
> 
> Imagine an array with 100 drives on which we store data in sets of
> (say) 6 data chunks and 2 parity chunks.
> Each group of 8 chunks is distributed over the 100 drives in a
> different way so that (e.g) 600 data chunks and 200 parity chunks are
> distributed over 8 physical stripes using some clever distribution
> function.
> If (when) one drive fails, the 8 chunks in this set of 8 physical
> stripes can be recovered by reading 6*8 == 48 chunks which will each be
> on a different drive.  Half the drives deliver only one chunk (in an ideal
> distribution) and the other half deliver none.  Maybe they will deliver
> some for the next set of 100 logical stripes.
> 
> You would probably say that even doing raid6 on 100 drives is crazy.
> Better to make, e.g. 10 groups of 10 and do raid6 on each of the 10,
> then LVM them together.
> 
> By doing declustered parity you can sanely do raid6 on 100 drives, using
> a logical stripe size that is much smaller than 100.
> When recovering a single drive, the 10-groups-of-10 would put heavy load
> on 9 other drives, while the decluster approach puts light load on 99
> other drives.  No matter how clever md is at throttling recovery, I
> would still rather distribute the load so that md has an easier job.
> 

That sounds smart.  I don't see that you need anything particularly
complicated for how you distribute your data and parity drives across
the 100 disks - you just need a fairly even spread.

I would be more concerned with how you could deal with resizing such an
array.  In particular, I think it is not unlikely that someone with a
100 drive array will one day want to add another bank of 24 disks (or
whatever fits in a cabinet).  Making that work nicely would, I believe,
be more important than making sure the rebuild load distribution is
balanced evenly across 99 drives.

I would also be interested in how the data and parities are distributed
across cabinets and disk controllers.  When you manually build from
smaller raid sets, you can ensure that in set the data disks and the
parity are all in different cabinets - that way if an entire cabinet
goes up in smoke, you have lost one drive from each set, and your data
is still there.  With a pseudo random layout, you have lost that.  (I
don't know how often entire cabinets of disks die, but I once lost both
disks of a raid1 mirror when the disk controller card died.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-31  9:58     ` David Brown
@ 2018-01-31 10:58       ` Johannes Thumshirn
  2018-01-31 14:27       ` Wols Lists
  1 sibling, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2018-01-31 10:58 UTC (permalink / raw)
  To: David Brown
  Cc: NeilBrown, Wols Lists, lsf-pc, linux-raid, linux-block,
	Hannes Reinecke, Neil Brown

David Brown <david.brown@hesbynett.no> writes:
> That sounds smart.  I don't see that you need anything particularly
> complicated for how you distribute your data and parity drives across
> the 100 disks - you just need a fairly even spread.

Exactly.

> I would be more concerned with how you could deal with resizing such an
> array.  In particular, I think it is not unlikely that someone with a
> 100 drive array will one day want to add another bank of 24 disks (or
> whatever fits in a cabinet).  Making that work nicely would, I believe,
> be more important than making sure the rebuild load distribution is
> balanced evenly across 99 drives.

I don't think rebuilding is such a big deal, lets consider the following
hypothetical scenario:

6 Disks with 4 data blocks (3 replicas per block, could be RAID1 like
duplicates or RAID5 like data + parity, doesn't matter at all for this
example)

D1  D2  D3  D4  D5  D6
[A] [B] [C] [ ] [ ] [ ]
[ ] [ ] [ ] [A] [D] [B] 
[ ] [A] [B] [ ] [C] [ ]
[C] [ ] [ ] [D] [ ] [D]

Now we're adding one disk and rebalance:

D1  D2  D3  D4  D5  D6  D7
[A] [B] [C] [ ] [ ] [ ] [A]
[ ] [ ] [ ] [ ] [D] [B] [ ]
[ ] [A] [B] [ ] [ ] [ ] [C]
[C] [ ] [ ] [D] [ ] [D] [ ]

This moved the "A" from D4 and the "C" from D5 to D7. The whole
rebalancing affected only 3 disks (read from D4 and D5 write to D7).

> I would also be interested in how the data and parities are distributed
> across cabinets and disk controllers.  When you manually build from
> smaller raid sets, you can ensure that in set the data disks and the
> parity are all in different cabinets - that way if an entire cabinet
> goes up in smoke, you have lost one drive from each set, and your data
> is still there.  With a pseudo random layout, you have lost that.  (I
> don't know how often entire cabinets of disks die, but I once lost both
> disks of a raid1 mirror when the disk controller card died.)

Well this is something CRSUH takes care of. As I said earlier it's a
weighted decision tree. One of the weights could be to evenly spread all
blocks across two cabinets.

Taking this into account would require a non-trivial user interface and
I'm not sure if the benefits of this outnumber the costs (at least for
an initial implementation).

Byte,
        Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-31  9:58     ` David Brown
  2018-01-31 10:58       ` Johannes Thumshirn
@ 2018-01-31 14:27       ` Wols Lists
  2018-01-31 14:41         ` David Brown
  1 sibling, 1 reply; 14+ messages in thread
From: Wols Lists @ 2018-01-31 14:27 UTC (permalink / raw)
  To: David Brown, NeilBrown, Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

On 31/01/18 09:58, David Brown wrote:
> I would also be interested in how the data and parities are distributed
> across cabinets and disk controllers.  When you manually build from
> smaller raid sets, you can ensure that in set the data disks and the
> parity are all in different cabinets - that way if an entire cabinet
> goes up in smoke, you have lost one drive from each set, and your data
> is still there.  With a pseudo random layout, you have lost that.  (I
> don't know how often entire cabinets of disks die, but I once lost both
> disks of a raid1 mirror when the disk controller card died.)

The more I think about how I plan to spec raid-61, the more a modulo
approach seems to make sense. That way, it'll be fairly easy to predict
what ends up where, and make sure your disks are evenly scattered.

I think both your and my approach might have problems with losing an
entire cabinet, however. Depends on how many drives per cabinet ...

Anyways, my second thoughts are ...

We have what I will call a stripe-block. The lowest common multiple of
"disks needed" ie number of mirrors times number of drives in the
raid-6, and the disks available.

Assuming my blocks are all stored sequentially I can then quickly
calculate their position in this stripe-block. But this will fall foul
of just hammering the drives nearest to the failed drive. But if I
pseudo-randomise this position with "position * prime mod drives" where
"prime" is not common to either the number of drives or the number or
mirrors or the number of raid-drives, then this should achieve my aim of
uniquely shuffling the location of all the blocks without collisions.

Pretty simple maths, for efficiency, that smears the data over all the
drives. Does that sound feasible? All the heavy lifting, calculating the
least common multiple, finding the prime, etc etc can be done at array
set-up time.

(If this then allows feasible 100-drive arrays, we won't just need an
incremental assemble mode, we might need an incremental build mode :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-31 14:27       ` Wols Lists
@ 2018-01-31 14:41         ` David Brown
  0 siblings, 0 replies; 14+ messages in thread
From: David Brown @ 2018-01-31 14:41 UTC (permalink / raw)
  To: Wols Lists, NeilBrown, Johannes Thumshirn, lsf-pc
  Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown

On 31/01/18 15:27, Wols Lists wrote:
> On 31/01/18 09:58, David Brown wrote:
>> I would also be interested in how the data and parities are distributed
>> across cabinets and disk controllers.  When you manually build from
>> smaller raid sets, you can ensure that in set the data disks and the
>> parity are all in different cabinets - that way if an entire cabinet
>> goes up in smoke, you have lost one drive from each set, and your data
>> is still there.  With a pseudo random layout, you have lost that.  (I
>> don't know how often entire cabinets of disks die, but I once lost both
>> disks of a raid1 mirror when the disk controller card died.)
> 
> The more I think about how I plan to spec raid-61, the more a modulo
> approach seems to make sense. That way, it'll be fairly easy to predict
> what ends up where, and make sure your disks are evenly scattered.
> 
> I think both your and my approach might have problems with losing an
> entire cabinet, however. Depends on how many drives per cabinet ...

Exactly.  I don't know how many cabinets are used on such systems.

> 
> Anyways, my second thoughts are ...
> 
> We have what I will call a stripe-block. The lowest common multiple of
> "disks needed" ie number of mirrors times number of drives in the
> raid-6, and the disks available.
> 
> Assuming my blocks are all stored sequentially I can then quickly
> calculate their position in this stripe-block. But this will fall foul
> of just hammering the drives nearest to the failed drive. But if I
> pseudo-randomise this position with "position * prime mod drives" where
> "prime" is not common to either the number of drives or the number or
> mirrors or the number of raid-drives, then this should achieve my aim of
> uniquely shuffling the location of all the blocks without collisions.
> 
> Pretty simple maths, for efficiency, that smears the data over all the
> drives. Does that sound feasible? All the heavy lifting, calculating the
> least common multiple, finding the prime, etc etc can be done at array
> set-up time.

Something like that should work, and be convenient to implement.  I am
not sure off the top of my head if such a simple modulo system is valid,
but it won't be difficult to check.

> 
> (If this then allows feasible 100-drive arrays, we won't just need an
> incremental assemble mode, we might need an incremental build mode :-)
> 

You really want to track which stripes are valid here, and which are not
yet made consistent.  A blank array will start with everything marked
invalid or inconsistent - build mode is just a matter of writing the
metadata.  You only need to make stripes consistent when you first write
to them.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-29 16:32 ` Wols Lists
  2018-01-29 21:50   ` NeilBrown
@ 2018-01-30  9:40   ` Johannes Thumshirn
  2018-01-31  8:03     ` David Brown
  1 sibling, 1 reply; 14+ messages in thread
From: Johannes Thumshirn @ 2018-01-30  9:40 UTC (permalink / raw)
  To: Wols Lists; +Cc: lsf-pc, linux-raid, linux-block, Hannes Reinecke, Neil Brown

Wols Lists <antlists@youngman.org.uk> writes:

> On 29/01/18 15:23, Johannes Thumshirn wrote:
>> Hi linux-raid, lsf-pc
>> 
>> (If you've received this mail multiple times, I'm sorry, I'm having
>> trouble with the mail setup).
>
> My immediate reactions as a lay person (I edit the raid wiki) ...
>> 
>> With the rise of bigger and bigger disks, array rebuilding times start
>> skyrocketing.
>
> And? Yes, your data is at risk during a rebuild, but md-raid throttles
> the i/o, so it doesn't hammer the system.
>> 
>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>> similar to RAID5 but instead of utilizing all disks in an array for
>> every I/O operation, but implement a per-I/O mapping function to only
>> use a subset of the available disks.
>> 
>> This has at least two advantages:
>> 1) If one disk has to be replaced, it's not needed to read the data from
>>    all disks to recover the one failed disk so non-affected disks can be
>>    used for real user I/O and not just recovery and
>
> Again, that's throttling, so that's not a problem ...

And throttling in a production environment is not exactly
desired. Imagine a 500 disk array (and yes this is something we've seen
with MD) and you have to replace disks. While the array is rebuilt you
have to throttle all I/O because with raid-{1,5,6,10} all data is
striped across all disks.

With a parity declustered RAID (or DDP like Dell, NetApp or Huawei call
it) you don't have to as the I/O is replicated in parity groups across a
subset of disks. All I/O targeting disks which aren't needed to recover
the data from the failed disks aren't affected by the throttling at all.

>> 2) an efficient mapping function can improve parallel I/O submission, as
>>    two different I/Os are not necessarily going to the same disks in the
>>    array. 
>> 
>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
>> would be ideal, as it provides a pseudo random but deterministic mapping
>> for the I/O onto the drives.
>> 
>> This whole declustering of cause only makes sense for more than (at
>> least) 4 drives but we do have customers with several orders of
>> magnitude more drivers in an MD array.
>
> If you have four drives or more - especially if they are multi-terabyte
> drives - you should NOT be using raid-5 ...

raid-6 won't help you much in above scenario.

>> 
>> At LSF I'd like to discuss if:
>> 1) The wider MD audience is interested in de-clusterd RAID with MD
>
> I haven't read the papers, so no comment, sorry.
>
>> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or
>>    as a new personality
>
> Neither! If you're going to do it, it should be raid-6.
>
>> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that
>>    the NetApp E-Series Arrays do use CRUSH for parity declustering)
>> 
>> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
>> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
>> [3]
>> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
>> 
> Okay - I've now skimmed the crush paper [2]. Looks well interesting.
> BUT. It feels more like btrfs than it does like raid.
>
> Btrfs manages disks, and does raid, it tries to be the "everything
> between the hard drive and the file". This crush thingy reads to me like
> it wants to be the same. There's nothing wrong with that, but md is a
> unix-y "do one thing (raid) and do it well".

Well CRUSH is (one of) the algorithms behind Ceph. It takes the
decisions where to place a block. It is just a hash (well technically a
weighted decision-tree) function that takes a block of I/O and a some
configuration parameters and "calculates" the placement.

> My knee-jerk reaction is if you want to go for it, it sounds like a good
> idea. It just doesn't really feel a good fit for md.

Thanks for the input.

       Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] De-clustered RAID with MD
  2018-01-30  9:40   ` Johannes Thumshirn
@ 2018-01-31  8:03     ` David Brown
  0 siblings, 0 replies; 14+ messages in thread
From: David Brown @ 2018-01-31  8:03 UTC (permalink / raw)
  To: Johannes Thumshirn, Wols Lists
  Cc: lsf-pc, linux-raid, linux-block, Hannes Reinecke, Neil Brown

On 30/01/18 10:40, Johannes Thumshirn wrote:
> Wols Lists <antlists@youngman.org.uk> writes:
> 
>> On 29/01/18 15:23, Johannes Thumshirn wrote:
>>> Hi linux-raid, lsf-pc
>>>
>>> (If you've received this mail multiple times, I'm sorry, I'm having
>>> trouble with the mail setup).
>>
>> My immediate reactions as a lay person (I edit the raid wiki) ...
>>>
>>> With the rise of bigger and bigger disks, array rebuilding times start
>>> skyrocketing.
>>
>> And? Yes, your data is at risk during a rebuild, but md-raid throttles
>> the i/o, so it doesn't hammer the system.
>>>
>>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>>> similar to RAID5 but instead of utilizing all disks in an array for
>>> every I/O operation, but implement a per-I/O mapping function to only
>>> use a subset of the available disks.
>>>
>>> This has at least two advantages:
>>> 1) If one disk has to be replaced, it's not needed to read the data from
>>>    all disks to recover the one failed disk so non-affected disks can be
>>>    used for real user I/O and not just recovery and
>>
>> Again, that's throttling, so that's not a problem ...
> 
> And throttling in a production environment is not exactly
> desired. Imagine a 500 disk array (and yes this is something we've seen
> with MD) and you have to replace disks. While the array is rebuilt you
> have to throttle all I/O because with raid-{1,5,6,10} all data is
> striped across all disks.

You definitely don't want a stripe across 500 disks!  I'd be inclined to
have raid1 pairs as the basic block, or perhaps 6-8 drive raid6 if you
want higher space efficiency.  Then you build your full array on top of
that, along with a file system that can take advantage of the layout.
If you have an XFS over a linear concat of these sets, then you have a
system that can quickly server many parallel loads - but that could be
poor distribution if you are storing massive streaming data.  And
rebuilds only delay data from the one block that is involved in the rebuild.

(I have no experience with anything bigger than about 6 disks - this is
just theory on my part.)

> 
> With a parity declustered RAID (or DDP like Dell, NetApp or Huawei call
> it) you don't have to as the I/O is replicated in parity groups across a
> subset of disks. All I/O targeting disks which aren't needed to recover
> the data from the failed disks aren't affected by the throttling at all.
> 
>>> 2) an efficient mapping function can improve parallel I/O submission, as
>>>    two different I/Os are not necessarily going to the same disks in the
>>>    array. 
>>>
>>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
>>> would be ideal, as it provides a pseudo random but deterministic mapping
>>> for the I/O onto the drives.
>>>
>>> This whole declustering of cause only makes sense for more than (at
>>> least) 4 drives but we do have customers with several orders of
>>> magnitude more drivers in an MD array.
>>
>> If you have four drives or more - especially if they are multi-terabyte
>> drives - you should NOT be using raid-5 ...
> 
> raid-6 won't help you much in above scenario.
> 

Raid-6 is still a great deal better than raid-5 :-)

And for your declustered raid or distributed parity, you can have two
parities rather than just one.



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-02-03 17:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-29 15:23 [LSF/MM TOPIC] De-clustered RAID with MD Johannes Thumshirn
2018-01-29 16:32 ` Wols Lists
2018-01-29 21:50   ` NeilBrown
2018-01-30 10:43     ` Wols Lists
2018-01-30 11:24       ` NeilBrown
2018-01-30 17:40         ` Wol's lists
2018-02-03 15:53         ` Wols Lists
2018-02-03 17:16         ` Wols Lists
2018-01-31  9:58     ` David Brown
2018-01-31 10:58       ` Johannes Thumshirn
2018-01-31 14:27       ` Wols Lists
2018-01-31 14:41         ` David Brown
2018-01-30  9:40   ` Johannes Thumshirn
2018-01-31  8:03     ` David Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).