* [LSF/MM TOPIC] De-clustered RAID with MD
@ 2018-01-29 15:23 Johannes Thumshirn
2018-01-29 16:32 ` Wols Lists
0 siblings, 1 reply; 14+ messages in thread
From: Johannes Thumshirn @ 2018-01-29 15:23 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown
Hi linux-raid, lsf-pc
(If you've received this mail multiple times, I'm sorry, I'm having
trouble with the mail setup).
With the rise of bigger and bigger disks, array rebuilding times start
skyrocketing.
In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
similar to RAID5 but instead of utilizing all disks in an array for
every I/O operation, but implement a per-I/O mapping function to only
use a subset of the available disks.
This has at least two advantages:
1) If one disk has to be replaced, it's not needed to read the data from
all disks to recover the one failed disk so non-affected disks can be
used for real user I/O and not just recovery and
2) an efficient mapping function can improve parallel I/O submission, as
two different I/Os are not necessarily going to the same disks in the
array.
For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
would be ideal, as it provides a pseudo random but deterministic mapping
for the I/O onto the drives.
This whole declustering of cause only makes sense for more than (at
least) 4 drives but we do have customers with several orders of
magnitude more drivers in an MD array.
At LSF I'd like to discuss if:
1) The wider MD audience is interested in de-clusterd RAID with MD
2) de-clustered RAID should be implemented as a sublevel of RAID5 or
as a new personality
3) CRUSH is a suitible algorith for this (there's evidence in [3] that
the NetApp E-Series Arrays do use CRUSH for parity declustering)
[1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf
[2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
[3]
https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
Thanks,
Johannes
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-29 15:23 [LSF/MM TOPIC] De-clustered RAID with MD Johannes Thumshirn
@ 2018-01-29 16:32 ` Wols Lists
2018-01-29 21:50 ` [Lsf-pc] " NeilBrown
2018-01-30 9:40 ` [Lsf-pc] " Johannes Thumshirn
0 siblings, 2 replies; 14+ messages in thread
From: Wols Lists @ 2018-01-29 16:32 UTC (permalink / raw)
To: Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown
On 29/01/18 15:23, Johannes Thumshirn wrote:
> Hi linux-raid, lsf-pc
>
> (If you've received this mail multiple times, I'm sorry, I'm having
> trouble with the mail setup).
My immediate reactions as a lay person (I edit the raid wiki) ...
>
> With the rise of bigger and bigger disks, array rebuilding times start
> skyrocketing.
And? Yes, your data is at risk during a rebuild, but md-raid throttles
the i/o, so it doesn't hammer the system.
>
> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
> similar to RAID5 but instead of utilizing all disks in an array for
> every I/O operation, but implement a per-I/O mapping function to only
> use a subset of the available disks.
>
> This has at least two advantages:
> 1) If one disk has to be replaced, it's not needed to read the data from
> all disks to recover the one failed disk so non-affected disks can be
> used for real user I/O and not just recovery and
Again, that's throttling, so that's not a problem ...
> 2) an efficient mapping function can improve parallel I/O submission, as
> two different I/Os are not necessarily going to the same disks in the
> array.
>
> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
> would be ideal, as it provides a pseudo random but deterministic mapping
> for the I/O onto the drives.
>
> This whole declustering of cause only makes sense for more than (at
> least) 4 drives but we do have customers with several orders of
> magnitude more drivers in an MD array.
If you have four drives or more - especially if they are multi-terabyte
drives - you should NOT be using raid-5 ...
>
> At LSF I'd like to discuss if:
> 1) The wider MD audience is interested in de-clusterd RAID with MD
I haven't read the papers, so no comment, sorry.
> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or
> as a new personality
Neither! If you're going to do it, it should be raid-6.
> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that
> the NetApp E-Series Arrays do use CRUSH for parity declustering)
>
> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf
> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
> [3]
> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
>
Okay - I've now skimmed the crush paper [2]. Looks well interesting.
BUT. It feels more like btrfs than it does like raid.
Btrfs manages disks, and does raid, it tries to be the "everything
between the hard drive and the file". This crush thingy reads to me like
it wants to be the same. There's nothing wrong with that, but md is a
unix-y "do one thing (raid) and do it well".
My knee-jerk reaction is if you want to go for it, it sounds like a good
idea. It just doesn't really feel a good fit for md.
Cheers,
Wol
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-29 16:32 ` Wols Lists
@ 2018-01-29 21:50 ` NeilBrown
2018-01-30 10:43 ` Wols Lists
2018-01-31 9:58 ` [Lsf-pc] " David Brown
2018-01-30 9:40 ` [Lsf-pc] " Johannes Thumshirn
1 sibling, 2 replies; 14+ messages in thread
From: NeilBrown @ 2018-01-29 21:50 UTC (permalink / raw)
To: Wols Lists, Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Neil Brown, Hannes Reinecke
[-- Attachment #1.1: Type: text/plain, Size: 4381 bytes --]
On Mon, Jan 29 2018, Wols Lists wrote:
> On 29/01/18 15:23, Johannes Thumshirn wrote:
>> Hi linux-raid, lsf-pc
>>
>> (If you've received this mail multiple times, I'm sorry, I'm having
>> trouble with the mail setup).
>
> My immediate reactions as a lay person (I edit the raid wiki) ...
>>
>> With the rise of bigger and bigger disks, array rebuilding times start
>> skyrocketing.
>
> And? Yes, your data is at risk during a rebuild, but md-raid throttles
> the i/o, so it doesn't hammer the system.
>>
>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>> similar to RAID5 but instead of utilizing all disks in an array for
>> every I/O operation, but implement a per-I/O mapping function to only
>> use a subset of the available disks.
>>
>> This has at least two advantages:
>> 1) If one disk has to be replaced, it's not needed to read the data from
>> all disks to recover the one failed disk so non-affected disks can be
>> used for real user I/O and not just recovery and
>
> Again, that's throttling, so that's not a problem ...
Imagine an array with 100 drives on which we store data in sets of
(say) 6 data chunks and 2 parity chunks.
Each group of 8 chunks is distributed over the 100 drives in a
different way so that (e.g) 600 data chunks and 200 parity chunks are
distributed over 8 physical stripes using some clever distribution
function.
If (when) one drive fails, the 8 chunks in this set of 8 physical
stripes can be recovered by reading 6*8 == 48 chunks which will each be
on a different drive. Half the drives deliver only one chunk (in an ideal
distribution) and the other half deliver none. Maybe they will deliver
some for the next set of 100 logical stripes.
You would probably say that even doing raid6 on 100 drives is crazy.
Better to make, e.g. 10 groups of 10 and do raid6 on each of the 10,
then LVM them together.
By doing declustered parity you can sanely do raid6 on 100 drives, using
a logical stripe size that is much smaller than 100.
When recovering a single drive, the 10-groups-of-10 would put heavy load
on 9 other drives, while the decluster approach puts light load on 99
other drives. No matter how clever md is at throttling recovery, I
would still rather distribute the load so that md has an easier job.
NeilBrown
>
>> 2) an efficient mapping function can improve parallel I/O submission, as
>> two different I/Os are not necessarily going to the same disks in the
>> array.
>>
>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
>> would be ideal, as it provides a pseudo random but deterministic mapping
>> for the I/O onto the drives.
>>
>> This whole declustering of cause only makes sense for more than (at
>> least) 4 drives but we do have customers with several orders of
>> magnitude more drivers in an MD array.
>
> If you have four drives or more - especially if they are multi-terabyte
> drives - you should NOT be using raid-5 ...
>>
>> At LSF I'd like to discuss if:
>> 1) The wider MD audience is interested in de-clusterd RAID with MD
>
> I haven't read the papers, so no comment, sorry.
>
>> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or
>> as a new personality
>
> Neither! If you're going to do it, it should be raid-6.
>
>> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that
>> the NetApp E-Series Arrays do use CRUSH for parity declustering)
>>
>> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf
>> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
>> [3]
>> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
>>
> Okay - I've now skimmed the crush paper [2]. Looks well interesting.
> BUT. It feels more like btrfs than it does like raid.
>
> Btrfs manages disks, and does raid, it tries to be the "everything
> between the hard drive and the file". This crush thingy reads to me like
> it wants to be the same. There's nothing wrong with that, but md is a
> unix-y "do one thing (raid) and do it well".
>
> My knee-jerk reaction is if you want to go for it, it sounds like a good
> idea. It just doesn't really feel a good fit for md.
>
> Cheers,
> Wol
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
Lsf-pc mailing list
Lsf-pc@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-29 16:32 ` Wols Lists
2018-01-29 21:50 ` [Lsf-pc] " NeilBrown
@ 2018-01-30 9:40 ` Johannes Thumshirn
2018-01-31 8:03 ` David Brown
1 sibling, 1 reply; 14+ messages in thread
From: Johannes Thumshirn @ 2018-01-30 9:40 UTC (permalink / raw)
To: Wols Lists; +Cc: linux-raid, linux-block, Neil Brown, lsf-pc, Hannes Reinecke
V29scyBMaXN0cyA8YW50bGlzdHNAeW91bmdtYW4ub3JnLnVrPiB3cml0ZXM6Cgo+IE9uIDI5LzAx
LzE4IDE1OjIzLCBKb2hhbm5lcyBUaHVtc2hpcm4gd3JvdGU6Cj4+IEhpIGxpbnV4LXJhaWQsIGxz
Zi1wYwo+PiAKPj4gKElmIHlvdSd2ZSByZWNlaXZlZCB0aGlzIG1haWwgbXVsdGlwbGUgdGltZXMs
IEknbSBzb3JyeSwgSSdtIGhhdmluZwo+PiB0cm91YmxlIHdpdGggdGhlIG1haWwgc2V0dXApLgo+
Cj4gTXkgaW1tZWRpYXRlIHJlYWN0aW9ucyBhcyBhIGxheSBwZXJzb24gKEkgZWRpdCB0aGUgcmFp
ZCB3aWtpKSAuLi4KPj4gCj4+IFdpdGggdGhlIHJpc2Ugb2YgYmlnZ2VyIGFuZCBiaWdnZXIgZGlz
a3MsIGFycmF5IHJlYnVpbGRpbmcgdGltZXMgc3RhcnQKPj4gc2t5cm9ja2V0aW5nLgo+Cj4gQW5k
PyBZZXMsIHlvdXIgZGF0YSBpcyBhdCByaXNrIGR1cmluZyBhIHJlYnVpbGQsIGJ1dCBtZC1yYWlk
IHRocm90dGxlcwo+IHRoZSBpL28sIHNvIGl0IGRvZXNuJ3QgaGFtbWVyIHRoZSBzeXN0ZW0uCj4+
IAo+PiBJbiBhIHBhcGVyIGZvcm0gJzkyIEhvbGxhbmQgYW5kIEdpYnNvbiBbMV0gc3VnZ2VzdCBh
IG1hcHBpbmcgYWxnb3JpdGhtCj4+IHNpbWlsYXIgdG8gUkFJRDUgYnV0IGluc3RlYWQgb2YgdXRp
bGl6aW5nIGFsbCBkaXNrcyBpbiBhbiBhcnJheSBmb3IKPj4gZXZlcnkgSS9PIG9wZXJhdGlvbiwg
YnV0IGltcGxlbWVudCBhIHBlci1JL08gbWFwcGluZyBmdW5jdGlvbiB0byBvbmx5Cj4+IHVzZSBh
IHN1YnNldCBvZiB0aGUgYXZhaWxhYmxlIGRpc2tzLgo+PiAKPj4gVGhpcyBoYXMgYXQgbGVhc3Qg
dHdvIGFkdmFudGFnZXM6Cj4+IDEpIElmIG9uZSBkaXNrIGhhcyB0byBiZSByZXBsYWNlZCwgaXQn
cyBub3QgbmVlZGVkIHRvIHJlYWQgdGhlIGRhdGEgZnJvbQo+PiAgICBhbGwgZGlza3MgdG8gcmVj
b3ZlciB0aGUgb25lIGZhaWxlZCBkaXNrIHNvIG5vbi1hZmZlY3RlZCBkaXNrcyBjYW4gYmUKPj4g
ICAgdXNlZCBmb3IgcmVhbCB1c2VyIEkvTyBhbmQgbm90IGp1c3QgcmVjb3ZlcnkgYW5kCj4KPiBB
Z2FpbiwgdGhhdCdzIHRocm90dGxpbmcsIHNvIHRoYXQncyBub3QgYSBwcm9ibGVtIC4uLgoKQW5k
IHRocm90dGxpbmcgaW4gYSBwcm9kdWN0aW9uIGVudmlyb25tZW50IGlzIG5vdCBleGFjdGx5CmRl
c2lyZWQuIEltYWdpbmUgYSA1MDAgZGlzayBhcnJheSAoYW5kIHllcyB0aGlzIGlzIHNvbWV0aGlu
ZyB3ZSd2ZSBzZWVuCndpdGggTUQpIGFuZCB5b3UgaGF2ZSB0byByZXBsYWNlIGRpc2tzLiBXaGls
ZSB0aGUgYXJyYXkgaXMgcmVidWlsdCB5b3UKaGF2ZSB0byB0aHJvdHRsZSBhbGwgSS9PIGJlY2F1
c2Ugd2l0aCByYWlkLXsxLDUsNiwxMH0gYWxsIGRhdGEgaXMKc3RyaXBlZCBhY3Jvc3MgYWxsIGRp
c2tzLgoKV2l0aCBhIHBhcml0eSBkZWNsdXN0ZXJlZCBSQUlEIChvciBERFAgbGlrZSBEZWxsLCBO
ZXRBcHAgb3IgSHVhd2VpIGNhbGwKaXQpIHlvdSBkb24ndCBoYXZlIHRvIGFzIHRoZSBJL08gaXMg
cmVwbGljYXRlZCBpbiBwYXJpdHkgZ3JvdXBzIGFjcm9zcyBhCnN1YnNldCBvZiBkaXNrcy4gQWxs
IEkvTyB0YXJnZXRpbmcgZGlza3Mgd2hpY2ggYXJlbid0IG5lZWRlZCB0byByZWNvdmVyCnRoZSBk
YXRhIGZyb20gdGhlIGZhaWxlZCBkaXNrcyBhcmVuJ3QgYWZmZWN0ZWQgYnkgdGhlIHRocm90dGxp
bmcgYXQgYWxsLgoKPj4gMikgYW4gZWZmaWNpZW50IG1hcHBpbmcgZnVuY3Rpb24gY2FuIGltcHJv
dmUgcGFyYWxsZWwgSS9PIHN1Ym1pc3Npb24sIGFzCj4+ICAgIHR3byBkaWZmZXJlbnQgSS9PcyBh
cmUgbm90IG5lY2Vzc2FyaWx5IGdvaW5nIHRvIHRoZSBzYW1lIGRpc2tzIGluIHRoZQo+PiAgICBh
cnJheS4gCj4+IAo+PiBGb3IgdGhlIG1hcHBpbmcgZnVuY3Rpb24gdXNlZCBhIGhhc2hpbmcgYWxn
b3JpdGhtIGxpa2UgQ2VwaCdzIENSVVNIIFsyXQo+PiB3b3VsZCBiZSBpZGVhbCwgYXMgaXQgcHJv
dmlkZXMgYSBwc2V1ZG8gcmFuZG9tIGJ1dCBkZXRlcm1pbmlzdGljIG1hcHBpbmcKPj4gZm9yIHRo
ZSBJL08gb250byB0aGUgZHJpdmVzLgo+PiAKPj4gVGhpcyB3aG9sZSBkZWNsdXN0ZXJpbmcgb2Yg
Y2F1c2Ugb25seSBtYWtlcyBzZW5zZSBmb3IgbW9yZSB0aGFuIChhdAo+PiBsZWFzdCkgNCBkcml2
ZXMgYnV0IHdlIGRvIGhhdmUgY3VzdG9tZXJzIHdpdGggc2V2ZXJhbCBvcmRlcnMgb2YKPj4gbWFn
bml0dWRlIG1vcmUgZHJpdmVycyBpbiBhbiBNRCBhcnJheS4KPgo+IElmIHlvdSBoYXZlIGZvdXIg
ZHJpdmVzIG9yIG1vcmUgLSBlc3BlY2lhbGx5IGlmIHRoZXkgYXJlIG11bHRpLXRlcmFieXRlCj4g
ZHJpdmVzIC0geW91IHNob3VsZCBOT1QgYmUgdXNpbmcgcmFpZC01IC4uLgoKcmFpZC02IHdvbid0
IGhlbHAgeW91IG11Y2ggaW4gYWJvdmUgc2NlbmFyaW8uCgo+PiAKPj4gQXQgTFNGIEknZCBsaWtl
IHRvIGRpc2N1c3MgaWY6Cj4+IDEpIFRoZSB3aWRlciBNRCBhdWRpZW5jZSBpcyBpbnRlcmVzdGVk
IGluIGRlLWNsdXN0ZXJkIFJBSUQgd2l0aCBNRAo+Cj4gSSBoYXZlbid0IHJlYWQgdGhlIHBhcGVy
cywgc28gbm8gY29tbWVudCwgc29ycnkuCj4KPj4gMikgZGUtY2x1c3RlcmVkIFJBSUQgc2hvdWxk
IGJlIGltcGxlbWVudGVkIGFzIGEgc3VibGV2ZWwgb2YgUkFJRDUgb3IKPj4gICAgYXMgYSBuZXcg
cGVyc29uYWxpdHkKPgo+IE5laXRoZXIhIElmIHlvdSdyZSBnb2luZyB0byBkbyBpdCwgaXQgc2hv
dWxkIGJlIHJhaWQtNi4KPgo+PiAzKSBDUlVTSCBpcyBhIHN1aXRpYmxlIGFsZ29yaXRoIGZvciB0
aGlzICh0aGVyZSdzIGV2aWRlbmNlIGluIFszXSB0aGF0Cj4+ICAgIHRoZSBOZXRBcHAgRS1TZXJp
ZXMgQXJyYXlzIGRvIHVzZSBDUlVTSCBmb3IgcGFyaXR5IGRlY2x1c3RlcmluZykKPj4gCj4+IFsx
XSBodHRwOi8vd3d3LnBkbC5jbXUuZWR1L1BETC1GVFAvRGVjbHVzdGVyaW5nL0FTUExPUy5wZGYg
Cj4+IFsyXSBodHRwczovL2NlcGguY29tL3dwLWNvbnRlbnQvdXBsb2Fkcy8yMDE2LzA4L3dlaWwt
Y3J1c2gtc2MwNi5wZGYKPj4gWzNdCj4+IGh0dHBzOi8vd3d3LnNuaWEub3JnL3NpdGVzL2RlZmF1
bHQvZmlsZXMvZmlsZXMyL2ZpbGVzMi9TREMyMDEzL3ByZXNlbnRhdGlvbnMvRGlzdHJpYnV0ZWRT
dG9yYWdlL0ppYmJlLUd3YWx0bmV5X01ldGhvZC10b19Fc3RhYmxpc2hfSGlnaF9BdmFpbGFiaWxp
dHkucGRmCj4+IAo+IE9rYXkgLSBJJ3ZlIG5vdyBza2ltbWVkIHRoZSBjcnVzaCBwYXBlciBbMl0u
IExvb2tzIHdlbGwgaW50ZXJlc3RpbmcuCj4gQlVULiBJdCBmZWVscyBtb3JlIGxpa2UgYnRyZnMg
dGhhbiBpdCBkb2VzIGxpa2UgcmFpZC4KPgo+IEJ0cmZzIG1hbmFnZXMgZGlza3MsIGFuZCBkb2Vz
IHJhaWQsIGl0IHRyaWVzIHRvIGJlIHRoZSAiZXZlcnl0aGluZwo+IGJldHdlZW4gdGhlIGhhcmQg
ZHJpdmUgYW5kIHRoZSBmaWxlIi4gVGhpcyBjcnVzaCB0aGluZ3kgcmVhZHMgdG8gbWUgbGlrZQo+
IGl0IHdhbnRzIHRvIGJlIHRoZSBzYW1lLiBUaGVyZSdzIG5vdGhpbmcgd3Jvbmcgd2l0aCB0aGF0
LCBidXQgbWQgaXMgYQo+IHVuaXgteSAiZG8gb25lIHRoaW5nIChyYWlkKSBhbmQgZG8gaXQgd2Vs
bCIuCgpXZWxsIENSVVNIIGlzIChvbmUgb2YpIHRoZSBhbGdvcml0aG1zIGJlaGluZCBDZXBoLiBJ
dCB0YWtlcyB0aGUKZGVjaXNpb25zIHdoZXJlIHRvIHBsYWNlIGEgYmxvY2suIEl0IGlzIGp1c3Qg
YSBoYXNoICh3ZWxsIHRlY2huaWNhbGx5IGEKd2VpZ2h0ZWQgZGVjaXNpb24tdHJlZSkgZnVuY3Rp
b24gdGhhdCB0YWtlcyBhIGJsb2NrIG9mIEkvTyBhbmQgYSBzb21lCmNvbmZpZ3VyYXRpb24gcGFy
YW1ldGVycyBhbmQgImNhbGN1bGF0ZXMiIHRoZSBwbGFjZW1lbnQuCgo+IE15IGtuZWUtamVyayBy
ZWFjdGlvbiBpcyBpZiB5b3Ugd2FudCB0byBnbyBmb3IgaXQsIGl0IHNvdW5kcyBsaWtlIGEgZ29v
ZAo+IGlkZWEuIEl0IGp1c3QgZG9lc24ndCByZWFsbHkgZmVlbCBhIGdvb2QgZml0IGZvciBtZC4K
ClRoYW5rcyBmb3IgdGhlIGlucHV0LgoKICAgICAgIEpvaGFubmVzCi0tIApKb2hhbm5lcyBUaHVt
c2hpcm4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBTdG9yYWdlCmp0
aHVtc2hpcm5Ac3VzZS5kZSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgKzQ5IDkxMSA3
NDA1MyA2ODkKU1VTRSBMSU5VWCBHbWJILCBNYXhmZWxkc3RyLiA1LCA5MDQwOSBOw7xybmJlcmcK
R0Y6IEZlbGl4IEltZW5kw7ZyZmZlciwgSmFuZSBTbWl0aGFyZCwgR3JhaGFtIE5vcnRvbgpIUkIg
MjEyODQgKEFHIE7DvHJuYmVyZykKS2V5IGZpbmdlcnByaW50ID0gRUMzOCA5Q0FCIEMyQzQgRjI1
RCA4NjAwIEQwRDAgMDM5MyA5NjlEIDJENzYgMDg1MApfX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fXwpMc2YtcGMgbWFpbGluZyBsaXN0CkxzZi1wY0BsaXN0cy5s
aW51eC1mb3VuZGF0aW9uLm9yZwpodHRwczovL2xpc3RzLmxpbnV4Zm91bmRhdGlvbi5vcmcvbWFp
bG1hbi9saXN0aW5mby9sc2YtcGMK
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-29 21:50 ` [Lsf-pc] " NeilBrown
@ 2018-01-30 10:43 ` Wols Lists
2018-01-30 11:24 ` NeilBrown
2018-01-31 9:58 ` [Lsf-pc] " David Brown
1 sibling, 1 reply; 14+ messages in thread
From: Wols Lists @ 2018-01-30 10:43 UTC (permalink / raw)
To: NeilBrown, Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown
On 29/01/18 21:50, NeilBrown wrote:
> By doing declustered parity you can sanely do raid6 on 100 drives, using
> a logical stripe size that is much smaller than 100.
> When recovering a single drive, the 10-groups-of-10 would put heavy load
> on 9 other drives, while the decluster approach puts light load on 99
> other drives. No matter how clever md is at throttling recovery, I
> would still rather distribute the load so that md has an easier job.
Not offering to do it ... :-)
But that sounds a bit like linux raid-10. Could a simple approach be to
do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two
parity, striped across 100 drives? Okay, it's not as good as the
decluster approach, but it would spread the stress of a rebuild across
20 drives, not 10. And probably be fairly easy to implement.
Cheers,
Wol
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-30 10:43 ` Wols Lists
@ 2018-01-30 11:24 ` NeilBrown
2018-01-30 17:40 ` Wol's lists
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: NeilBrown @ 2018-01-30 11:24 UTC (permalink / raw)
To: Wols Lists, Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown
[-- Attachment #1: Type: text/plain, Size: 1848 bytes --]
On Tue, Jan 30 2018, Wols Lists wrote:
> On 29/01/18 21:50, NeilBrown wrote:
>> By doing declustered parity you can sanely do raid6 on 100 drives, using
>> a logical stripe size that is much smaller than 100.
>> When recovering a single drive, the 10-groups-of-10 would put heavy load
>> on 9 other drives, while the decluster approach puts light load on 99
>> other drives. No matter how clever md is at throttling recovery, I
>> would still rather distribute the load so that md has an easier job.
>
> Not offering to do it ... :-)
>
> But that sounds a bit like linux raid-10. Could a simple approach be to
> do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two
> parity, striped across 100 drives? Okay, it's not as good as the
> decluster approach, but it would spread the stress of a rebuild across
> 20 drives, not 10. And probably be fairly easy to implement.
If you did that, I think you would be about 80% of the way to fully
declustered-parity RAID.
If you then tweak the math a bit so that one stripe would was
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 ....
and the next
A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 ....
and then
A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 ....
XX
When Ax are a logical stripe and Bx are the next, then you have a
slightly better distribution. If device XX fails then the reads needed
for the first stripe mostly come from different drives than those for
the second stripe, which are mostly different again for the 3rd stripe.
Presumably the CRUSH algorithm (which I only skim-read once about a year
ago) formalizes how to do this, and does it better.
Once you have the data handling in place for your proposal, it should be
little more than replacing a couple of calculations to get the full
solution.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-30 11:24 ` NeilBrown
@ 2018-01-30 17:40 ` Wol's lists
2018-02-03 15:53 ` Wols Lists
2018-02-03 17:16 ` Wols Lists
2 siblings, 0 replies; 14+ messages in thread
From: Wol's lists @ 2018-01-30 17:40 UTC (permalink / raw)
To: NeilBrown, Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown
On 30/01/18 11:24, NeilBrown wrote:
> On Tue, Jan 30 2018, Wols Lists wrote:
>
>> On 29/01/18 21:50, NeilBrown wrote:
>>> By doing declustered parity you can sanely do raid6 on 100 drives, using
>>> a logical stripe size that is much smaller than 100.
>>> When recovering a single drive, the 10-groups-of-10 would put heavy load
>>> on 9 other drives, while the decluster approach puts light load on 99
>>> other drives. No matter how clever md is at throttling recovery, I
>>> would still rather distribute the load so that md has an easier job.
>>
>> Not offering to do it ... :-)
>>
>> But that sounds a bit like linux raid-10. Could a simple approach be to
>> do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two
>> parity, striped across 100 drives? Okay, it's not as good as the
>> decluster approach, but it would spread the stress of a rebuild across
>> 20 drives, not 10. And probably be fairly easy to implement.
>
> If you did that, I think you would be about 80% of the way to fully
> declustered-parity RAID.
> If you then tweak the math a bit so that one stripe would was
>
> A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 ....
>
> and the next
>
> A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 ....
>
> and then
>
> A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 ....
>
> XX
>
> When Ax are a logical stripe and Bx are the next, then you have a
> slightly better distribution. If device XX fails then the reads needed
> for the first stripe mostly come from different drives than those for
> the second stripe, which are mostly different again for the 3rd stripe.
>
> Presumably the CRUSH algorithm (which I only skim-read once about a year
> ago) formalizes how to do this, and does it better.
> Once you have the data handling in place for your proposal, it should be
> little more than replacing a couple of calculations to get the full
> solution.
>
Okay. I think I have it - a definition for raid-16 (or is it raid-61).
But I need a bit of help with the maths. And it might need a look-up
table :-(
Okay. Like raid-10, raid-16 would be spec'd as "--level 16,3,8", ie 3
mirrors, emulating an 8-drive raid-6.
What I'm looking for is a PRNG that has the "bug" that it repeats over a
short period, and over that period is guaranteed to repeat each number
the same number of times. I saw a wonderful video demonstration of this
years ago - if you plot the generated number against the number of times
it was generated, after a few rows it "filled up" a rectangle on the graph.
At which point the maths becomes very simple. I just need at least as
many real drives as "mirrors times emulated".
If somebody can come up with the algorithm I want, I can spec it, and
then maybe someone can implement it? It'll be fun testing - I'll need my
new machine when I get it working :-)
Cheers,
Wol
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-30 9:40 ` [Lsf-pc] " Johannes Thumshirn
@ 2018-01-31 8:03 ` David Brown
0 siblings, 0 replies; 14+ messages in thread
From: David Brown @ 2018-01-31 8:03 UTC (permalink / raw)
To: Johannes Thumshirn, Wols Lists
Cc: lsf-pc, linux-raid, linux-block, Hannes Reinecke, Neil Brown
On 30/01/18 10:40, Johannes Thumshirn wrote:
> Wols Lists <antlists@youngman.org.uk> writes:
>
>> On 29/01/18 15:23, Johannes Thumshirn wrote:
>>> Hi linux-raid, lsf-pc
>>>
>>> (If you've received this mail multiple times, I'm sorry, I'm having
>>> trouble with the mail setup).
>>
>> My immediate reactions as a lay person (I edit the raid wiki) ...
>>>
>>> With the rise of bigger and bigger disks, array rebuilding times start
>>> skyrocketing.
>>
>> And? Yes, your data is at risk during a rebuild, but md-raid throttles
>> the i/o, so it doesn't hammer the system.
>>>
>>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>>> similar to RAID5 but instead of utilizing all disks in an array for
>>> every I/O operation, but implement a per-I/O mapping function to only
>>> use a subset of the available disks.
>>>
>>> This has at least two advantages:
>>> 1) If one disk has to be replaced, it's not needed to read the data from
>>> all disks to recover the one failed disk so non-affected disks can be
>>> used for real user I/O and not just recovery and
>>
>> Again, that's throttling, so that's not a problem ...
>
> And throttling in a production environment is not exactly
> desired. Imagine a 500 disk array (and yes this is something we've seen
> with MD) and you have to replace disks. While the array is rebuilt you
> have to throttle all I/O because with raid-{1,5,6,10} all data is
> striped across all disks.
You definitely don't want a stripe across 500 disks! I'd be inclined to
have raid1 pairs as the basic block, or perhaps 6-8 drive raid6 if you
want higher space efficiency. Then you build your full array on top of
that, along with a file system that can take advantage of the layout.
If you have an XFS over a linear concat of these sets, then you have a
system that can quickly server many parallel loads - but that could be
poor distribution if you are storing massive streaming data. And
rebuilds only delay data from the one block that is involved in the rebuild.
(I have no experience with anything bigger than about 6 disks - this is
just theory on my part.)
>
> With a parity declustered RAID (or DDP like Dell, NetApp or Huawei call
> it) you don't have to as the I/O is replicated in parity groups across a
> subset of disks. All I/O targeting disks which aren't needed to recover
> the data from the failed disks aren't affected by the throttling at all.
>
>>> 2) an efficient mapping function can improve parallel I/O submission, as
>>> two different I/Os are not necessarily going to the same disks in the
>>> array.
>>>
>>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
>>> would be ideal, as it provides a pseudo random but deterministic mapping
>>> for the I/O onto the drives.
>>>
>>> This whole declustering of cause only makes sense for more than (at
>>> least) 4 drives but we do have customers with several orders of
>>> magnitude more drivers in an MD array.
>>
>> If you have four drives or more - especially if they are multi-terabyte
>> drives - you should NOT be using raid-5 ...
>
> raid-6 won't help you much in above scenario.
>
Raid-6 is still a great deal better than raid-5 :-)
And for your declustered raid or distributed parity, you can have two
parities rather than just one.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-29 21:50 ` [Lsf-pc] " NeilBrown
2018-01-30 10:43 ` Wols Lists
@ 2018-01-31 9:58 ` David Brown
2018-01-31 10:58 ` Johannes Thumshirn
2018-01-31 14:27 ` Wols Lists
1 sibling, 2 replies; 14+ messages in thread
From: David Brown @ 2018-01-31 9:58 UTC (permalink / raw)
To: NeilBrown, Wols Lists, Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Neil Brown, Hannes Reinecke
On 29/01/18 22:50, NeilBrown wrote:
> On Mon, Jan 29 2018, Wols Lists wrote:
>
>> On 29/01/18 15:23, Johannes Thumshirn wrote:
>>> Hi linux-raid, lsf-pc
>>>
>>> (If you've received this mail multiple times, I'm sorry, I'm having
>>> trouble with the mail setup).
>>
>> My immediate reactions as a lay person (I edit the raid wiki) ...
>>>
>>> With the rise of bigger and bigger disks, array rebuilding times start
>>> skyrocketing.
>>
>> And? Yes, your data is at risk during a rebuild, but md-raid throttles
>> the i/o, so it doesn't hammer the system.
>>>
>>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>>> similar to RAID5 but instead of utilizing all disks in an array for
>>> every I/O operation, but implement a per-I/O mapping function to only
>>> use a subset of the available disks.
>>>
>>> This has at least two advantages:
>>> 1) If one disk has to be replaced, it's not needed to read the data from
>>> all disks to recover the one failed disk so non-affected disks can be
>>> used for real user I/O and not just recovery and
>>
>> Again, that's throttling, so that's not a problem ...
>
> Imagine an array with 100 drives on which we store data in sets of
> (say) 6 data chunks and 2 parity chunks.
> Each group of 8 chunks is distributed over the 100 drives in a
> different way so that (e.g) 600 data chunks and 200 parity chunks are
> distributed over 8 physical stripes using some clever distribution
> function.
> If (when) one drive fails, the 8 chunks in this set of 8 physical
> stripes can be recovered by reading 6*8 == 48 chunks which will each be
> on a different drive. Half the drives deliver only one chunk (in an ideal
> distribution) and the other half deliver none. Maybe they will deliver
> some for the next set of 100 logical stripes.
>
> You would probably say that even doing raid6 on 100 drives is crazy.
> Better to make, e.g. 10 groups of 10 and do raid6 on each of the 10,
> then LVM them together.
>
> By doing declustered parity you can sanely do raid6 on 100 drives, using
> a logical stripe size that is much smaller than 100.
> When recovering a single drive, the 10-groups-of-10 would put heavy load
> on 9 other drives, while the decluster approach puts light load on 99
> other drives. No matter how clever md is at throttling recovery, I
> would still rather distribute the load so that md has an easier job.
>
That sounds smart. I don't see that you need anything particularly
complicated for how you distribute your data and parity drives across
the 100 disks - you just need a fairly even spread.
I would be more concerned with how you could deal with resizing such an
array. In particular, I think it is not unlikely that someone with a
100 drive array will one day want to add another bank of 24 disks (or
whatever fits in a cabinet). Making that work nicely would, I believe,
be more important than making sure the rebuild load distribution is
balanced evenly across 99 drives.
I would also be interested in how the data and parities are distributed
across cabinets and disk controllers. When you manually build from
smaller raid sets, you can ensure that in set the data disks and the
parity are all in different cabinets - that way if an entire cabinet
goes up in smoke, you have lost one drive from each set, and your data
is still there. With a pseudo random layout, you have lost that. (I
don't know how often entire cabinets of disks die, but I once lost both
disks of a raid1 mirror when the disk controller card died.)
_______________________________________________
Lsf-pc mailing list
Lsf-pc@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-31 9:58 ` [Lsf-pc] " David Brown
@ 2018-01-31 10:58 ` Johannes Thumshirn
2018-01-31 14:27 ` Wols Lists
1 sibling, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2018-01-31 10:58 UTC (permalink / raw)
To: David Brown
Cc: NeilBrown, Wols Lists, lsf-pc, linux-raid, linux-block,
Hannes Reinecke, Neil Brown
David Brown <david.brown@hesbynett.no> writes:
> That sounds smart. I don't see that you need anything particularly
> complicated for how you distribute your data and parity drives across
> the 100 disks - you just need a fairly even spread.
Exactly.
> I would be more concerned with how you could deal with resizing such an
> array. In particular, I think it is not unlikely that someone with a
> 100 drive array will one day want to add another bank of 24 disks (or
> whatever fits in a cabinet). Making that work nicely would, I believe,
> be more important than making sure the rebuild load distribution is
> balanced evenly across 99 drives.
I don't think rebuilding is such a big deal, lets consider the following
hypothetical scenario:
6 Disks with 4 data blocks (3 replicas per block, could be RAID1 like
duplicates or RAID5 like data + parity, doesn't matter at all for this
example)
D1 D2 D3 D4 D5 D6
[A] [B] [C] [ ] [ ] [ ]
[ ] [ ] [ ] [A] [D] [B]
[ ] [A] [B] [ ] [C] [ ]
[C] [ ] [ ] [D] [ ] [D]
Now we're adding one disk and rebalance:
D1 D2 D3 D4 D5 D6 D7
[A] [B] [C] [ ] [ ] [ ] [A]
[ ] [ ] [ ] [ ] [D] [B] [ ]
[ ] [A] [B] [ ] [ ] [ ] [C]
[C] [ ] [ ] [D] [ ] [D] [ ]
This moved the "A" from D4 and the "C" from D5 to D7. The whole
rebalancing affected only 3 disks (read from D4 and D5 write to D7).
> I would also be interested in how the data and parities are distributed
> across cabinets and disk controllers. When you manually build from
> smaller raid sets, you can ensure that in set the data disks and the
> parity are all in different cabinets - that way if an entire cabinet
> goes up in smoke, you have lost one drive from each set, and your data
> is still there. With a pseudo random layout, you have lost that. (I
> don't know how often entire cabinets of disks die, but I once lost both
> disks of a raid1 mirror when the disk controller card died.)
Well this is something CRSUH takes care of. As I said earlier it's a
weighted decision tree. One of the weights could be to evenly spread all
blocks across two cabinets.
Taking this into account would require a non-trivial user interface and
I'm not sure if the benefits of this outnumber the costs (at least for
an initial implementation).
Byte,
Johannes
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-31 9:58 ` [Lsf-pc] " David Brown
2018-01-31 10:58 ` Johannes Thumshirn
@ 2018-01-31 14:27 ` Wols Lists
2018-01-31 14:41 ` David Brown
1 sibling, 1 reply; 14+ messages in thread
From: Wols Lists @ 2018-01-31 14:27 UTC (permalink / raw)
To: David Brown, NeilBrown, Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown
On 31/01/18 09:58, David Brown wrote:
> I would also be interested in how the data and parities are distributed
> across cabinets and disk controllers. When you manually build from
> smaller raid sets, you can ensure that in set the data disks and the
> parity are all in different cabinets - that way if an entire cabinet
> goes up in smoke, you have lost one drive from each set, and your data
> is still there. With a pseudo random layout, you have lost that. (I
> don't know how often entire cabinets of disks die, but I once lost both
> disks of a raid1 mirror when the disk controller card died.)
The more I think about how I plan to spec raid-61, the more a modulo
approach seems to make sense. That way, it'll be fairly easy to predict
what ends up where, and make sure your disks are evenly scattered.
I think both your and my approach might have problems with losing an
entire cabinet, however. Depends on how many drives per cabinet ...
Anyways, my second thoughts are ...
We have what I will call a stripe-block. The lowest common multiple of
"disks needed" ie number of mirrors times number of drives in the
raid-6, and the disks available.
Assuming my blocks are all stored sequentially I can then quickly
calculate their position in this stripe-block. But this will fall foul
of just hammering the drives nearest to the failed drive. But if I
pseudo-randomise this position with "position * prime mod drives" where
"prime" is not common to either the number of drives or the number or
mirrors or the number of raid-drives, then this should achieve my aim of
uniquely shuffling the location of all the blocks without collisions.
Pretty simple maths, for efficiency, that smears the data over all the
drives. Does that sound feasible? All the heavy lifting, calculating the
least common multiple, finding the prime, etc etc can be done at array
set-up time.
(If this then allows feasible 100-drive arrays, we won't just need an
incremental assemble mode, we might need an incremental build mode :-)
Cheers,
Wol
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-31 14:27 ` Wols Lists
@ 2018-01-31 14:41 ` David Brown
0 siblings, 0 replies; 14+ messages in thread
From: David Brown @ 2018-01-31 14:41 UTC (permalink / raw)
To: Wols Lists, NeilBrown, Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown
On 31/01/18 15:27, Wols Lists wrote:
> On 31/01/18 09:58, David Brown wrote:
>> I would also be interested in how the data and parities are distributed
>> across cabinets and disk controllers. When you manually build from
>> smaller raid sets, you can ensure that in set the data disks and the
>> parity are all in different cabinets - that way if an entire cabinet
>> goes up in smoke, you have lost one drive from each set, and your data
>> is still there. With a pseudo random layout, you have lost that. (I
>> don't know how often entire cabinets of disks die, but I once lost both
>> disks of a raid1 mirror when the disk controller card died.)
>
> The more I think about how I plan to spec raid-61, the more a modulo
> approach seems to make sense. That way, it'll be fairly easy to predict
> what ends up where, and make sure your disks are evenly scattered.
>
> I think both your and my approach might have problems with losing an
> entire cabinet, however. Depends on how many drives per cabinet ...
Exactly. I don't know how many cabinets are used on such systems.
>
> Anyways, my second thoughts are ...
>
> We have what I will call a stripe-block. The lowest common multiple of
> "disks needed" ie number of mirrors times number of drives in the
> raid-6, and the disks available.
>
> Assuming my blocks are all stored sequentially I can then quickly
> calculate their position in this stripe-block. But this will fall foul
> of just hammering the drives nearest to the failed drive. But if I
> pseudo-randomise this position with "position * prime mod drives" where
> "prime" is not common to either the number of drives or the number or
> mirrors or the number of raid-drives, then this should achieve my aim of
> uniquely shuffling the location of all the blocks without collisions.
>
> Pretty simple maths, for efficiency, that smears the data over all the
> drives. Does that sound feasible? All the heavy lifting, calculating the
> least common multiple, finding the prime, etc etc can be done at array
> set-up time.
Something like that should work, and be convenient to implement. I am
not sure off the top of my head if such a simple modulo system is valid,
but it won't be difficult to check.
>
> (If this then allows feasible 100-drive arrays, we won't just need an
> incremental assemble mode, we might need an incremental build mode :-)
>
You really want to track which stripes are valid here, and which are not
yet made consistent. A blank array will start with everything marked
invalid or inconsistent - build mode is just a matter of writing the
metadata. You only need to make stripes consistent when you first write
to them.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-30 11:24 ` NeilBrown
2018-01-30 17:40 ` Wol's lists
@ 2018-02-03 15:53 ` Wols Lists
2018-02-03 17:16 ` Wols Lists
2 siblings, 0 replies; 14+ messages in thread
From: Wols Lists @ 2018-02-03 15:53 UTC (permalink / raw)
To: NeilBrown, Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown
On 30/01/18 11:24, NeilBrown wrote:
> When Ax are a logical stripe and Bx are the next, then you have a
> slightly better distribution. If device XX fails then the reads needed
> for the first stripe mostly come from different drives than those for
> the second stripe, which are mostly different again for the 3rd stripe.
Well, my maths has been well and truly proved wrong. But. Does streaming
from one drive directly onto another have value in reducing stress? I've
just done a worked example for four drives, mirrored three times across
eighteen drives.
The result is that the drives have ended up in six groups of three
mirrored drives. So if any drive fails, I can just copy an
almost-identical drive from the mirror.
I'll have to play with eg 17 drives and see if this alters the dynamics
- it'll be interesting.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
a1 b2 c3 a4 c1 a2 b3 c4 b1 c2 a3 b4 a1 b2 c3 a4 c1 a2
b3 c4 b1 c2 a3 b4 a1 b2 c3 a4 c1 a2 b3 c4 b1 c2 a3 b4
Note I'm rotating 5 drives at a time, so a1,a2,a3,a4 goes on drives
0,5,10,15 ... But pick any drive and you'll notice the drives six away
are almost mirrors.
Cheers,
Wol
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD
2018-01-30 11:24 ` NeilBrown
2018-01-30 17:40 ` Wol's lists
2018-02-03 15:53 ` Wols Lists
@ 2018-02-03 17:16 ` Wols Lists
2 siblings, 0 replies; 14+ messages in thread
From: Wols Lists @ 2018-02-03 17:16 UTC (permalink / raw)
To: NeilBrown, Johannes Thumshirn, lsf-pc
Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown
On 30/01/18 11:24, NeilBrown wrote:
> When Ax are a logical stripe and Bx are the next, then you have a
> slightly better distribution. If device XX fails then the reads needed
> for the first stripe mostly come from different drives than those for
> the second stripe, which are mostly different again for the 3rd stripe.
Having got myself thoroughly confused, I think I've got my head
straight. There's actually an incredibly simple algorithm for this, I think.
First of all, let's state that we're defining a new complex raid here,
people get confused between linux-raid-10 and raid-1+0, so ...
I was originally thinking about about raid-61, mirroring raid-6 across a
random number of drives. I think here we're actually talking about
raid-60 - doing a raid-6 across a random number of drives. Can we prefix
whatever raid level we call it with an "L" so it's not confused with
standard raid-6+0?
Okay. Let's say we have a five-disk raid-6 smeared across seven drives.
Let's call it Raid-L60. That gives us 35 blocks per disk stripe.
A1 A2 A3 A4 A5 B1 B2
B3 B4 B5 C1 C2 C3 C4
C5 D1 D2 D3 D4 D5 E1
E2 E3 E4 E5 F1 F2 F3
F4 F5 G1 G2 G3 G4 G5
That gives us a logical disk stripe 35 wide, that we just need to scramble.
Position * Prime mod Disks.
The problem, of course, is that this is going to scramble each logical
stripe the same. So as we've got 5 stripes, can we use a table of 5
different primes? The snag, of course, is that we mustn't use a prime
that is a factor of the number of disks, as this will blow up with a
vengeance. It makes sense also to avoid a prime that's a factor of the
logical disks, I suspect. So here, I'd use 2,3,11,13,17
I'd like to add the logical stripe number in rather than use multiple
primes, but I don't think that'll work ...
Cheers,
Wol
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2018-02-03 17:16 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-29 15:23 [LSF/MM TOPIC] De-clustered RAID with MD Johannes Thumshirn
2018-01-29 16:32 ` Wols Lists
2018-01-29 21:50 ` [Lsf-pc] " NeilBrown
2018-01-30 10:43 ` Wols Lists
2018-01-30 11:24 ` NeilBrown
2018-01-30 17:40 ` Wol's lists
2018-02-03 15:53 ` Wols Lists
2018-02-03 17:16 ` Wols Lists
2018-01-31 9:58 ` [Lsf-pc] " David Brown
2018-01-31 10:58 ` Johannes Thumshirn
2018-01-31 14:27 ` Wols Lists
2018-01-31 14:41 ` David Brown
2018-01-30 9:40 ` [Lsf-pc] " Johannes Thumshirn
2018-01-31 8:03 ` David Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox